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Foreword 


The  Thirteenth  International  Workshop  on  Principles  of  Diagnosis  (DX  02)  is  the 
latest  in  a  series  of  annual  workshops  that  focus  on  the  presentation  and  exchange 
of  current  results  in  the  field  of  diagnosis  and  related  areas,  including  tasks  such  as 
monitoring,  fault  identification  and  isolation,  testing,  reconfiguration  and  repair.  The 
workshops  are  historically  centered  on  approaches  from  the  Artificial  Intelligence  (AI) 
community,  but  aim  at  supporting  wide  range  of  different  techniques  and  methodolo¬ 
gies,  as  well  as  the  integration  of  other  research  communities  such  as  Process  Engi¬ 
neering  and  FDI. 

The  papers  included  in  this  volume  span  a  wide  range  of  techniques  and  application 
areas,  including  such  domains  as  complex  hardware  systems,  software  and  knowledge 
bases,  secure  systems,  and  design  problems,  and  deal  with  discrete  and  continuous, 
algebraic,  logic-,  constraint-,  structure-,  and  probability-based  approaches,  dynamic 
and  temporal  systems,  distribution  and  abstraction,  and  non-symbolic  methods  of  di¬ 
agnosis.  They  hear  witness  to  the  continuing  existence  of  fertile  ground  for  further 
theoretical  and  applied  research. 

The  invited  talks  continue  the  choice  of  earlier  workshops  to  bring  in  new  and 
varying  viewpoints  to  provide  a  wider  context  to  the  problem  area,  and  address  issues 
from  related  and  neighboring  areas  of  interest  to  the  diagnosis  community:  constraint 
satisfaction,  problem  decomposition,  and  debugging. 

We  wish  to  thank  the  authors  of  the  submitted  papers,  the  program  committee 
members,  at  least  two  of  which  reviewed  each  of  the  submitted  full  papers,  for  the 
time  and  effort  spent,  and  the  invited  speakers  for  their  participation.  We  especially 
wish  to  thank  Sheila  Mcllraith  for  her  help  in  organizing  the  review  process. 

We  would  also  like  to  acknowledge  the  support  of  our  sponsors  for  their  contribu¬ 
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Particle  Filters  for  Real-Time  Fault  Detection  in 

Planetary  Rovers 

Richard  Dearden  and  Dan  Clancy 
Research  Institute  for  Advanced  Computer  Science 
NASA  Ames  Research  Center 
Mail  Stop  269-3  Moffett  Field,  CA  94035  USA 
Email :  dearden,  clancy  @  ptolemy.arc.nasa.gov 


Abstract. 

Planetary  rovers  provide  a  considerable  challenge  for  artificial  in¬ 
telligence  in  that  they  must  operate  for  long  periods  autonomously, 
or  with  relatively  little  intervention.  To  achieve  this,  they  need  to 
have  on-board  fault  detection  and  diagnosis  capabilities.  Traditional 
model-based  diagnosis  techniques  are  not  suitable  for  rovers  due  to 
the  tight  coupling  between  the  vehicle’s  performance  and  its  envi¬ 
ronment.  Hybrid  diagnosis  using  particle  filters  is  presented  as  an 
alternative,  and  its  strengths  and  weaknesses  are  examined.  We  also 
present  some  extensions  to  particle  filters  that  are  designed  to  make 
them  more  suitable  for  use  in  diagnosis  problems. 

1  Introduction 

Planetary  rovers  provide  a  considerable  challenge  for  artificial  intel¬ 
ligence  in  that  they  must  operate  for  long  periods  autonomously,  or 
with  relatively  little  intervention.  To  achieve  this,  they  need  (among 
other  things)  to  have  on-board  fault  detection  and  diagnosis  capabil¬ 
ities  in  order  to  determine  the  actual  state  of  the  vehicle,  and  decide 
what  actions  are  safe  to  perform.  However,  as  we  will  discuss  be¬ 
low,  traditional  approaches  to  diagnosis  are  unsuitable  for  rovers,  and 
we  must  turn  to  hybrid  approaches.  In  this  paper  we  describe  an  ap¬ 
proach  to  hybrid  diagnosis  based  on  particle  filters  [2,  7,  3],  We  show 
that  the  characteristics  of  diagnosis  problems  present  some  difficul¬ 
ties  for  standard  particle  filters,  and  describe  an  approach  for  solving 
this  problem.  We  will  use  rovers  as  a  motivating  example  through¬ 
out  this  paper,  but  the  techniques  we  describe  can  be  applied  to  any 
hybrid  diagnosis  problem. 

The  diagnosis  problem  is  to  determine  the  current  state  of  a  system 
given  a  stream  of  observations  of  that  system.  In  traditional  model- 
based  diagnosis  systems  such  as  Livingstone  [14],  diagnosis  is  per¬ 
formed  by  maintaining  a  set  of  candidate  hypotheses  (in  Livingstone, 
a  single  hypothesis  was  used)  about  the  current  state  of  the  system, 
and  using  the  model  to  predict  the  expected  future  state  of  the  system 
given  each  candidate.  The  predicted  states  are  then  compared  with 
the  observations  of  what  actually  occurred.  If  the  observations  are 
consistent  with  a  particular  state  that  is  predicted,  that  state  is  kept  as 
a  candidate  hypothesis.  If  they  are  inconsistent,  the  candidate  is  dis¬ 
carded.  Traditional  diagnosis  systems  typically  use  a  logic-based  rep¬ 
resentation,  and  use  monitors  to  translate  continuous-valued  sensor 
readings  into  discrete-valued  variables.  The  system  can  then  reason 
about  the  discrete  variables,  and  compare  them  with  the  predictions 
of  the  model  using  constraint  propagation  techniques. 


Unlike  the  spacecraft  domains  that  Livingstone  has  been  applied 
in,  rover  performance  depends  significantly  on  environmental  inter¬ 
actions.  The  on-board  sensors  provide  streams  of  continuous  valued 
data  that  varies  due  to  noise,  but  also  due  to  the  interaction  between 
the  rover  and  its  environment.  For  example,  a  rover  may  have  a  sen¬ 
sor  that  reports  the  current  drawn  by  a  wheel.  In  normal  operation, 
this  quantity  may  vary  considerably,  increasing  when  the  vehicle  is 
climbing  a  hill,  and  decreasing  on  downward  slopes.  The  diagnosis 
system  needs  to  be  able  to  distinguish  a  change  in  the  current  drawn 
due  to  the  terrain  being  traversed  from  a  change  due  to  a  fault  in  the 
wheel.  A  second  issue  for  rovers  is  that  their  weight  and  power  is 
very  tightly  constrained.  For  this  reason,  any  on-board  diagnosis  sys¬ 
tem  must  be  computationally  efficient,  and  should  be  able  to  adapt 
to  variations  in  processor  availability.  Ideally,  we  would  also  like  it 
to  adapt  based  on  its  own  performance,  spending  more  time  on  diag¬ 
nosis  when  a  fault  is  likely  to  have  occurred,  and  less  time  when  the 
system  appears  to  be  operating  normally. 

A  rover’s  close  coupling  with  its  environment  poses  a  considerable 
problem  for  diagnosis  systems  that  use  discrete  models.  A  particular 
sensor  reading  may  be  normal  under  certain  environmental  condi¬ 
tions,  but  indicative  of  a  fault  in  others,  so  any  monitor  that  trans¬ 
lates  the  sensor  reading  into  a  discrete  value  such  as  “nominal,”  or 
“off-nominal  high”  must  be  sophisticated  enough  to  take  all  the  en¬ 
vironmental  conditions  into  account.  This  can  mean  that  the  diagno¬ 
sis  problem  is  effectively  passed  off  to  the  monitors — the  diagnosis 
system  is  very  simple,  but  relies  on  discrete  sensor  values  from  ex¬ 
tremely  complex  monitors  that  diagnose  the  interaction  between  the 
system  and  its  environment  as  part  of  translating  continuous  sensor 
values  into  discrete  variables.  To  overcome  this  problem,  we  need  to 
reason  directly  with  the  continuous  values  we  receive  from  sensors. 
That  is,  our  model  needs  to  be  a  hybrid  system,  consisting  of  a  set  of 
discrete  modes  that  the  system  can  be  in,  along  with  a  set  of  continu¬ 
ous  state  variables.  The  dynamics  of  the  system  is  described  in  terms 
of  a  set  of  equations  governing  the  evolution  of  the  state  variables, 
and  these  equations  will  be  different  in  different  modes.  In  addition, 
a  transition  function  describes  how  the  system  moves  from  one  mode 
to  another,  and  an  observation  function  defines  the  likelihood  of  an 
observation  given  the  mode  and  the  values  of  the  system  variables. 

This  hybrid  model  can  be  seen  as  a  partially  observable  Markov 
decision  process  (POMDP)  [1],  POMDPs  are  frequently  used  as  a 
representation  for  decision-theoretic  planning  problems,  where  the 
task  is  to  determine  the  best  action  to  perform  given  the  current  es¬ 
timate  of  the  actual  state  of  the  system.  This  estimate,  referred  to  as 


the  belief  state.is  exactly  what  we  would  like  to  determine  in  the  di¬ 
agnosis  problem,  and  the  problem  of  keeping  the  belief  state  updated 
is  well  understood  in  the  decision  theory  literature.  The  belief  state 
is  a  probability  distribution  over  the  system  states — that  is,  for  every 
state  it  gives  the  probability  of  being  in  that  state,  given  our  prior  be¬ 
liefs  about  the  state  of  the  system,  and  the  sequence  of  observations 
and  actions  that  have  occurred  so  far. 

Unfortunately,  maintaining  an  exact  belief  state  is  computation¬ 
ally  intractable  for  the  types  of  problem  we  are  interested  in.  Since 
our  model  contains  both  discrete  and  continuous  variables,  the  belief 
state  is  a  set  of  multidimensional  probability  distributions  over  the 
continuous  state  variables,  with  one  such  distribution  for  each  mode 
of  the  system.  These  distributions  may  not  even  be  unimodal,  so  just 
representing  the  belief  state  is  a  complex  problem,  but  updating  it 
when  new  observations  are  made  is  intractable  for  hybrid  models  in 
all  but  the  simplest  of  models  (see  [8]  for  an  illustration  of  this). 
Therefore,  an  approximation  needs  to  be  made.  As  we  said  above, 
we  will  use  a  particle  filter  to  approximate  the  belief  state  and  keep 
it  updated. 

A  particle  filter  represents  a  probability  distribution  using  a  set  of 
discrete  samples,  referred  to  as  particles,  each  of  which  has  an  associ¬ 
ated  weight.  The  set  of  weighted  particles  constitutes  an  approxima¬ 
tion  to  the  belief  state,  and  has  the  advantage  over  other  approaches 
such  as  Kalman  Filters  [6]  that  it  can  represent  arbitrary  distributions. 
To  update  the  distribution  when  a  new  observation  is  made,  we  treat 
each  particle  as  a  hypothesis  about  the  state  of  the  system,  apply  the 
model  to  it  to  move  it  to  a  new  state,  and  multiply  the  weight  of  the 
particle  by  the  likelihood  of  making  the  observation  in  that  new  state. 
To  prevent  a  small  number  of  particles  from  dominating  the  proba¬ 
bility  distribution,  the  particles  are  then  resampled,  with  a  new  set  of 
particles,  each  of  weight  one,  being  constructed  by  selecting  samples 
randomly  based  on  their  weight  from  the  old  set. 

Particle  filters  have  already  proven  very  successful  for  a  number  of 
tasks,  including  visual  tracking  [7]  and  robot  navigation  [4],  Unfor¬ 
tunately,  they  are  less  well  suited  to  diagnosis  tasks.  This  is  because 
the  mode  transitions  that  we  are  most  interested  in  detecting  namely 
transitions  to  fault  states  typically  have  very  low  probability  of  actu¬ 
ally  occurring.  Thus,  there  is  a  risk  that  there  will  be  no  particle  in  a 
fault  state  when  a  fault  occurs,  and  the  system  will  be  unable  to  diag¬ 
nose  the  fault.  We  propose  a  solution  to  this  problem  by  thinking  of  a 
particle  filter  as  a  convenient  way  to  divide  the  computation  time  that 
is  available  for  doing  diagnosis  between  the  candidate  states  that  the 
system  could  be  in.  A  conventional  particle  filter  splits  the  particles 
(and  hence  the  computation  time)  according  to  how  well  the  states 
predict  the  observations,  but  with  this  approach  we  will  also  spend 
some  computation  time  on  fault  states  that  are  important  to  diagnose. 
We  do  this  simply  by  ensuring  that  there  are  always  some  particles 
in  those  states.  As  we  will  show,  the  details  of  the  particle  filter  algo¬ 
rithm  mean  that  we  can  add  these  additional  particles  without  biasing 
the  diagnosis  that  results. 

In  the  next  section  we  discuss  the  hybrid  model  of  the  rover  in 
detail.  In  Section  3  we  describe  particle  filtering  and  demonstrate  its 
weaknesses  when  applied  to  diagnosis  problems,  and  in  Section  4 
we  will  describe  our  modifications  to  the  standard  particle  filter  in 
detail.  In  Section  5  we  present  some  preliminary  results  on  real  rover 
data,  using  a  simple  version  of  our  proposed  approach.  The  final  sec¬ 
tion  looks  at  the  relationship  between  this  work  and  some  previous 
approaches  to  this  problem,  and  discusses  some  future  directions  for 
this  work. 


2  Modeling  a  Planetary  Rover 

As  we  said  above,  we  model  a  rover  as  a  hybrid  system.  The  dis¬ 
crete  component  of  the  rover’s  state  represents  the  various  opera¬ 
tional  and  fault  modes  of  the  rover,  while  the  continuous  state  de¬ 
scribes  the  speed  of  the  wheels,  the  current  being  drawn  by  various 
subsystems,  and  so  on.  Following  [13],  our  rover  model  consists  of  a 
tuple  {M,  V,  T,  E,  O)  where  the  elements  of  the  tuple  are  as  follows: 

•  M  is  the  set  of  discrete  modes  the  system  can  be  in.  We  assume 
that  M  is  finite,  and  write  m  for  an  individual  system  mode. 

•  V  is  the  set  of  variables  describing  the  continuous  state  of  the 
system. 

•  T  is  a  transition  function  that  defines  how  the  system  moves  from 
one  mode  to  another  over  time.  We  write  Pr t(to,  to')  for  the 
probability  that  the  system  moves  from  mode  m  to  mode  m'  . 
We  may  also  include  a  second  transition  function  Pr t(to,  a,  m!) 
which  is  used  when  an  action  a  occurs.  This  gives  the  probability 
of  moving  from  m  to  m'  when  action  a  is  executed. 

•  E  is  a  set  of  equations  that  describe  the  evolution  of  the  continu¬ 
ous  variables  over  time.  The  equations  that  apply  at  a  given  time 
potentially  depend  on  the  system  mode,  so  we  write  Em  for  the 
equations  that  apply  in  mode  m.  These  equations  will  in  general 
include  a  noise  term  to  account  for  random  variations  in  the  state 
variables.  Here  we  will  assume  Gaussian  noise,  with  the  parame¬ 
ters  of  the  Gaussian  determined  individually  for  each  equation. 

•  O  is  a  function  mapping  the  system  state  into  observations.  We 
will  assume  that  the  observable  system  characteristics  are  some 
subset  of  the  system  variables  V,  with  their  values  corrupted  by 
Gaussian  noise  (again  with  parameters  that  may  be  a  function  of 
the  variable,  and  the  system  mode),  so  we  write  0(v,  m)  for  the 
observed  value  of  some  variable  v  in  mode  m. 

We  will  also  write  Pr(s'|s)  for  the  probability  distribution  over 
future  states  s'  given  some  state  s,where  s  and  s'  are  hybrid  states, 
so  Pr(s,|s)  includes  both  the  distribution  over  the  future  mode  given 
by  Pr  T (m' \ m), and  the  distributions  over  the  continuous  variables 
given  by  Em. 

The  diagnosis  problem  now  becomes  the  task  of  determining  the 
current  mode  m  that  the  system  is  in,  and  the  values  of  all  the  state 
variables  in  V  (the  results  we  will  present  will  only  show  the  proba¬ 
bility  distribution  over  discrete  modes,  but  the  algorithm  produces  a 
distribution  over  the  full  hybrid  state). 

The  experiments  we  will  present  in  Sections  3  and  5  use  actual 
telemetry  data  from  NASA  Ames  Marsokhod  rover.  The  Marsokhod 
is  a  planetary  rover  built  on  a  Russian  chassis  that  has  been  used  in 
field  tests  from  1993-99  in  Russia,  Hawaii,  and  the  deserts  of  Ari¬ 
zona  and  California.  The  rover  has  six  independently  driven  wheels, 
and  for  the  experiments  we  present  here,  the  right  rear  wheel  had  a 
broken  gear,  and  so  rolls  passively.  The  Marsokhod  has  a  number  of 
sensors,  but  we  will  restrict  our  attention  to  diagnosing  the  state  of 
the  broken  wheel,  and  will  therefore  use  only  data  from  the  wheel 
current  and  wheel  odometry  sensors.  We  will  treat  each  wheel  inde¬ 
pendently  in  the  diagnosis.  For  each  wheel,  we  have  a  model,  taken 
from  [13],  with  the  following  characteristics: 

•  M  consists  of  23  system  modes  of  which  14  are  fault  states. 

•  V  consists  of  variables  for  the  wheel  current  and  wheel  speed,  and 
the  derivatives  of  current  and  speed. 

•  T  is  a  fairly  sparse  matrix,  with  at  most  six  successors  for  any 
given  mode.  The  probability  of  a  transition  to  a  fault  state  is  0.01 
or  less.  All  commands  are  described  by  one  transition  function 


for  the  start  and  one  for  the  end  of  a  command  because  the  data 
doesn't  identify  which  command  occurred. 

•  The  state  equations  in  E  consist  of  the  previous  value  plus  a  con¬ 
stant  term  and  noise.  The  noise  is  Gaussian  with  standard  devia¬ 
tion  in  the  range  0.001  to  1.0,  and  the  equations  are  independent 
for  each  state  variable. 

•  The  equations  in  O  are  independent  for  each  variable  (but  vary 
depending  on  the  mode),  and  include  a  Gaussian  noise  term  with 
a  standard  deviation  that  varies  from  0.001  to  1.0. 

3  Particle  Filters 

A  particle  filter  approximates  an  unknown  probability  distribution 
using  a  weighted  set  of  samples.  Each  sample  or  particle  consists  of 
a  value  for  every  state  variable,  so  it  describes  one  possible  complete 
state  the  system  might  be  in.  As  observations  are  made,  the  transition 
function  is  applied  to  each  particle  individually,  moving  it  stochasti¬ 
cally  to  a  new  state,  and  then  the  observations  are  used  to  re-weight 
each  particle  to  reflect  the  likelihood  of  the  observation  given  the 
particle’s  new  state.  In  this  way,  particles  that  predict  the  observed 
system  performance  are  highly  weighted,  indicating  that  they  are  in 
likely  states  of  the  system.  A  major  advantage  of  particle  filters  is 
that  their  computational  requirements  depend  only  on  the  number  of 
particles,  not  on  the  complexity  of  the  model.  This  is  of  huge  im¬ 
portance  to  us  as  it  allows  us  to  do  diagnosis  in  an  anytime  fashion; 
increasing  or  decreasing  the  number  of  particles  depending  on  the 
available  computation  time. 

To  implement  a  particle  filter,  we  require  three  things: 

•  A  probability  distribution  over  the  initial  state  of  the  system. 

•  A  model  of  the  system  that  can  be  used  to  predict,  given  the  cur¬ 
rent  state  according  to  an  individual  particle,  a  possible  future  state 
of  that  particle.  Since  T  is  stochastic,  and  E  includes  noise  terms, 
the  predictive  model  selects  a  new  state  for  the  particle  in  a  Monte 
Carlo  fashion  [5],  choosing  by  sampling  from  the  probability  dis¬ 
tribution  over  possible  future  states. 

•  A  way  to  compute  the  likelihood  of  observing  particular  sensor 
values  given  a  state.  In  our  case,  this  is  given  by  the  observation 
function  O. 

The  particle  filtering  algorithm  is  given  in  Figure  1.  Step  (i)  is  the 
predictive  step ,  where  a  new  state  is  calculated  in  a  Monte  Carlo  way 
for  each  particle,  and  this  new  state  is  then  conditioned  on  the  obser¬ 
vations  in  step  (ii)  (we  call  this  the  re-weighting  step).  The  predictive 
step  is  performed  by  applying  T  to  each  particle,  and  then  apply¬ 
ing  the  appropriate  equations  from  E  to  the  state  variables,  sampling 
values  from  the  Gaussian  error  terms.  Once  the  particles  have  been 
re-weighted,  we  can  then  calculate  the  probability  of  each  mode  sim¬ 
ply  by  summing  the  weights  of  the  particles  in  the  mode.  We  refer  to 
step  (b)  as  the  resampling  step.  For  more  details  on  the  properties  of 
particle  filters  see  e.g.  [3], 

3.1  Problems  with  Particle  Filters  for  Diagnosis 

Unfortunately,  there  are  a  number  of  difficulties  in  applying  particle 
filters  to  diagnosis  problems.  In  particular,  the  filter  must  have  a  par¬ 
ticle  in  a  particular  state  before  the  probability  of  that  state  can  be 
evaluated.  If  a  state  has  no  particles  in  it,  its  probability  of  being  the 
true  state  of  the  system  is  zero.  This  is  a  particular  problem  in  diag¬ 
nosis  problems  because  the  transition  probabilities  to  fault  states  are 
typically  very  low,  so  particles  are  unlikely  to  end  up  in  fault  states 


1 .  Create  a  set  of  n  particles  where  each  particle  Pi  has  a  state  s* 
and  a  weight  Wi.  Si  is  sampled  randomly  from  the  prior  state 
distribution,  and  Wi  =  1. 

2.  For  each  time  step,  do: 

(a)  Replace  each  particle  pi  with  p\  as  follows: 

i.  Select  a  future  state  s[  by  sampling  from  Pr(sJ  |sj),  the 
distribution  over  possible  future  states  given  by  the  model. 

ii.  Re-weight  Pi  by  multiplying  its  weight  by  the  probability 
of  the  observations  o  given  sj  as  follows: 

w'i  =  Pr(o|s')wj 

(b)  Resample  n  new  particles  pi , . .  . ,  pn  by  copying  the  p'  cur¬ 
rent  particles  where  each  particle  p'-  is  added  to  the  new 
samples  with  the  following  probability: 

,  w'; 

Pr  (Pi=Pj)  = 

l^k=l  Wk 


Figure  1.  The  particle  filtering  algorithm. 


during  the  Monte  Carlo  predictive  step.  This  situation  is  known  as 
sample  impoverishment. 

Figure  2  illustrates  this  problem.  Each  graph  shows  the  most  likely 
modes  that  the  wheel  is  in  (the  y-axis  is  the  total  weight  of  the  parti¬ 
cles  in  each  mode,  so  a  value  of  10000  implies  that  all  particles  are  in 
that  mode),  shown  over  part  of  one  of  the  trials  in  which  the  wheel  is 
initially  idle,  and  then  at  step  12  is  commanded  to  drive  forward  at  a 
fixed  speed.  The  graphs  on  the  left  show  the  performance  of  Wheel  1, 
which  is  operating  nominally.  The  graphs  on  the  right  show  the  per¬ 
formance  of  Wheel  6,  which  is  faulty.  In  the  top  line,  the  probability 
of  the  fault  occurring  is  0.1  rather  than  its  true  value  of  0.01.  Here 
the  fault  is  quickly  detected  in  Wheel  6.  In  the  bottom  line  of  graphs, 
the  fault  probability  is  set  to  its  true  value,  and  in  this  case  the  fault 
is  not  successfully  detected  because  insufficient  particles  enter  the 
fault  state.  One  might  expect  that  once  a  particle  enters  the  fault  state 
its  weight  would  be  high  since  it  would  predict  well,  and  at  the  re¬ 
sampling  step  it  should  lead  to  several  new  particles  being  created. 
Unfortunately,  this  did  not  occur  in  this  situation  because  although 
some  particles  did  enter  the  fault  state,  their  continuous  parameter 
values  did  not  agree  with  the  observations  well,  so  they  still  had  low 
weights.  The  continuous  parameters  did  not  match  because  each  of 
the  particles  that  entered  the  fault  state  came  from  the  COMMANDE- 
DRUNNING  state,  in  which  the  current  and  wheel  speed  are  expected 
to  be  much  higher  than  the  observed  values. 

4  Importance  Sampling 

The  simplest  solution  to  the  sample  impoverishment  problem  is  to 
increase  the  number  of  particles  being  used.  Given  the  constraints 
imposed  on  on-board  systems,  this  approach  is  probably  unrealistic. 
The  data  presented  above  was  implemented  in  Java,  using  10,000 
particles  per  wheel,  and  runs  in  approximately  0.5  seconds  per  up¬ 
date  on  a  750MHz  Pentium  3.  This  is  probably  at  the  upper  limit 
of  the  number  of  particles  we  could  expect  to  use  on-board  a  rover 
as  the  time  available  for  diagnosis  is  longer,  but  there  will  be  less 
computation  devoted  to  diagnosis.  Thus  running  with  ten  times  as 
many  particles  (which  is  roughly  equivalent  to  multiplying  the  fault 


Figure  2.  Results  for  Wheel  1  (left  side)  and  Wheel  6  (right  side).  In  the  top  row,  the  probability  of  a  fault  is  ten  times  its  true  value,  while  the  bottom  row 
uses  the  true  probabilities.  The  fault  (GearAndEncoderFault RUNNING)  in  Wheel  6  is  quickly  detected  in  the  top  row,  but  is  never  discovered  in  the 

bottom  row  due  to  sample  impoverishment. 


probability  by  ten)  is  probably  impractical  on  the  rover,  and  even 
10,000  particles  may  be  unrealistic  as  the  model  gets  more  complex. 
This  could  be  somewhat  overcome  by  only  increasing  the  number 
of  particles  when  there  is  some  evidence  that  the  system  is  predict¬ 
ing  poorly.  In  order  to  achieve  this,  we  need  some  measure  of  when 
this  occurs.  The  obvious  measure  is  to  look  at  the  total  weight  of  the 
particles  after  conditioning  on  the  observations.  If  no  particles  are 
predicting  the  observations  well  the  total  weight  should  drop.  Un¬ 
fortunately,  in  practice  this  is  rarely  useful  because  there  are  a  num¬ 
ber  of  other  possible  causes  for  this  behavior.  For  example,  particles 
moving  from  a  state  in  which  there  is  high  confidence  in  the  sen¬ 
sor  readings  to  a  state  with  more  sensor  noise  will  tend  to  drop  in 
weight  even  if  they  are  still  predicting  the  observations  well.  We  see 
this  in  the  Marsokhod  model  because  the  IDLE  mode  has  relatively 
large  variance  for  the  observation  noise,  whereas  the  COMMANDE- 
DRUNNING  mode  has  smaller  variance,  so  the  total  particle  weight 
increases  when  the  system  moves  from  the  IDLE  to  the  COMMANDE- 
DRUNNING  mode,  even  for  wheel  6  where  COMMANDEDRUNNING 
predicts  the  observations  poorly. 

Another  way  to  reduce  the  likelihood  of  sample  impoverishment 
is  to  take  advantage  of  some  results  from  importance  sampling  (see 
e.g.  [9]).  In  importance  sampling,  we  want  to  sample  from  some  dis¬ 


tribution  P,  but  we  can’t.  Instead,  we  sample  from  some  other  dis¬ 
tribution  Q,  and  weight  each  sample  s  by  Prp  (s)  /  PrQ  (s),  the  ratio 
of  the  likelihood  of  sampling  s  from  P  to  the  likelihood  of  sampling 
it  from  Q.  The  weighted  sample  is  then  an  unbiased  sample  from 

P,  as  long  as  Q  is  non-zero  everywhere  that  P  is  non-zero.  In  fact, 
importance  sampling  is  exactly  what  the  particle  filter  algorithm  is 
doing.  For  a  particle  filter,  the  unknown  distribution  P  is  the  poste¬ 
rior  distribution  we  are  trying  to  compute.  Q  is  the  prior  distribution 
(the  set  of  samples  before  the  observation),  and  the  re-weighting  step 
corresponds  to  the  importance  sampling  weight  computation. 

Given  that  whatever  we  choose  for  Q,  the  weighted  samples  are 
an  unbiased  sample  from  P,  we  can  add  arbitrary  samples  to  our 
particle  filter  at  the  resampling  step,  and  still  end  up  with  an  unbiased 
posterior  distribution.  We  will  use  this  property  to  ensure  that  we 
have  samples  in  the  system  modes  that  are  important  to  us  (hence 
the  name  importance  sampling).  The  question  then  is  how  to  choose 

Q.  We  can  imagine  an  oracle  that  provides  a  set  of  candidate  states 
the  system  might  end  up  in,  given  the  current  distribution  over  state. 
When  we  resample,  we  simply  make  sure  that  there  are  always  some 
particles  in  the  states  provided  by  the  oracle.  If  those  states  explain 
the  subsequent  observations  well,  the  particles  in  them  will  get  high 
weight,  and  are  likely  to  be  resampled  with  more  particles  at  the  next 


Figure  3.  Results  for  the  importance  sampled  particle  filter.  All  states  with  >  25%  probability  were  used  as  starting  points  for  the  forward  search,  and  0.5% 
of  the  particles  were  assigned  to  each  of  the  found  states.  On  the  left  are  results  for  wheel  1,  and  on  the  right  for  wheel  6. 


step.  On  the  other  hand,  if  they  predict  the  observations  poorly,  the 
particles  will  quickly  disappear  again. 

The  question  that  remains  is  how  to  implement  the  oracle.  For  a 
complex  system  such  as  a  planetary  rover,  with  many  components 
each  with  its  own  set  of  possible  failure  modes,  there  are  exponen¬ 
tially  many  possible  failure  modes,  so  this  is  a  non-trivial  problem. 
However,  one  approach  that  seems  promising  is  to  use  a  traditional 
model-based  diagnosis  system  such  as  Livingstone  [14],  These  dis¬ 
crete  systems  typically  operate  more  quickly  than  hybrid  approaches, 
so  can  be  used  to  suggest  hypotheses  without  significantly  affecting 
computation  time.  We  pointed  out  in  the  introduction  that  they  are  not 
in  general  suitable  for  diagnosing  rovers,  but  they  could  be  used  to 
identify  sets  of  likely  system  modes  for  the  hybrid  system  to  be  used 
in.  The  integration  of  Livingstone  with  the  particle  filter  approach  is 
currently  work  in  progress,  as  it  adds  a  number  of  additional  compli¬ 
cations  including  building  an  additional  system  model,  and  ensuring 
that  the  discrete  and  hybrid  models  agree  with  one  another  and  can 
easily  be  translated  back  and  forth. 

For  simpler  systems  such  as  the  Marsokhod  wheel  diagnosis  we 
have  used  in  this  paper,  the  above  approach  is  unnecessary.  Instead, 
we  can  use  an  oracle  based  on  forward  search  from  the  current  high- 
probability  states.  Since  each  system  mode  in  this  model  has  at  most 
six  possible  successors,  and  there  are  typically  only  two  to  three  high 
probability  modes  at  any  time,  we  find  in  practice  that  in  most  cases 
a  simple  one-step  look-ahead  search  adds  fewer  than  five  modes  to 
those  that  already  contain  particles. 

5  Results 

The  results  we  present  here  are  based  on  the  Marsokhod  model  we 
described  in  Section  2.  Dr.  Rich  Washington  supplied  the  model  and 
the  data,  which  came  from  his  work  on  using  Kalman  filters  for  rover 
diagnosis  [13].  The  only  changes  made  to  the  model  were  to  make 
it  suitable  for  use  with  a  particle  filter;  no  changes  were  made  to 
model  parameters  or  transition  probabilities.  To  demonstrate  our  ap¬ 
proach  we  use  a  small  piece  of  one  of  the  telemetry  data  files  (the 
same  piece  used  in  Section  3)  in  which  the  rover  is  initially  idle,  and 
then  a  drive  command  is  issued,  resulting  in  an  increase  in  current  to 
each  wheel,  followed  by  a  corresponding  increase  in  speed,  and  then 


a  constant  speed.  As  before,  wheel  6  is  faulty,  with  a  broken  gear 
(this  corresponds  to  the  Gear AndEncoderFaultRunning  state 
in  the  model). 

Figure  3  shows  the  results  for  the  importance  sampled  particle  fil¬ 
ter.  We  used  single  step  forward  search  from  all  states  with  probabil¬ 
ity  >  0.25  to  select  the  set  of  bias  states.  Each  of  these  states  was 
then  guaranteed  to  receive  at  least  0.5%  of  the  total  number  of  par¬ 
ticles  at  each  re-sampling  step.  The  left  hand  graph  is  the  probable 
states  for  Wheel  1,  as  before.  Like  the  graph  in  the  bottom  row  of 
Figure  2,  the  PrematureAction  state  was  given  high  probability 
before  step  13.  This  state  appears  where  the  effects  of  an  action  are 
seen  before  the  signal  to  perform  the  action,  due  to  problems  with  the 
rover  telemetry.  In  this  case  it  is  a  spurious  result  due  to  the  model 
of  the  IDLE  state  not  allowing  sufficient  noise  in  the  observations.  A 
small  adjustment  to  the  model  would  remove  this  problem,  which  is 
only  present  in  the  data  for  two  of  the  wheels.  The  right  hand  graph 
shows  the  same  data  for  Wheel  6.  In  this  case,  the  fault  state  domi¬ 
nates  the  probability  distribution  after  step  20,  seven  steps  after  the 
command  to  drive  the  wheel  was  observed,  as  compared  with  three 
steps  for  the  model  with  increased  fault  probabilities  (Figure  2). 

6  Discussion  and  Relation  to  Other  Work 

An  important  thing  to  note  is  that  standard  particle  filters  treat  the 
model  essentially  as  a  black  box,  using  it  only  to  predict  future  states 
of  the  system.  We  have  described  one  approach  that  exploits  the 
structure  in  the  model  to  make  diagnosis  using  particle  filters  more 
effective.  This  is  by  no  means  the  only  way  to  exploit  that  structure, 
and  we  are  in  the  process  of  looking  at  other  techniques.  Possibilities 
include  making  a  single-fault  assumption  (but  relaxing  it  if  it  predicts 
the  observations  poorly),  and  taking  advantage  of  independence  be¬ 
tween  components  in  the  system  to  reduce  the  number  of  samples 
needed,  or  even  to  diagnose  subsystems  independently. 

One  closely  related  piece  of  work  is  Verma  et  al.’s  decision- 
theoretic  particle  filter  [11].  Their  approach  is  similar  to  ours,  but 
they  assign  a  utility — which  corresponds  to  how  important  each  state 
is  to  diagnose — to  every  state  and  multiply  the  probability  of  a  tran¬ 
sition  by  the  utility  of  the  state  that  results.  This  alters  the  transition 
function  to  favor  important  states,  rather  like  the  approach  we  took 


in  Figure  2.  For  relatively  simple  diagnosis  tasks  such  as  the  one 
we  have  presented  here,  the  approaches  seem  very  similar.  Flowever, 
designing  a  utility  function  to  produce  the  right  diagnoses,  without 
causing  too  many  false  diagnoses  of  faults  is  a  difficult  task,  espe¬ 
cially  as  any  reasonable  utility  function  would  give  all  fault  states 
a  high  utility.  In  [10],  they  refine  this  approach,  again  using  a  risk 
function  that  scores  states  by  how  important  it  is  to  diagnose  them 
correctly,  but  this  time  modifying  the  particle  filter  algorithm  so  that 
the  samples  are  distributed  according  to  the  product  of  the  posterior 
probability  distribution  and  the  risk  factor.  This  ensures  that  samples 
in  high-risk  states  have  higher  weights,  and  the  true  posterior  can  be 
recovered  from  the  risk-sensitive  posterior,  but  still  suffers  from  the 
problem  of  sample  impoverishment.  These  approaches  are  orthogo¬ 
nal  to  the  approach  described  here,  and  we  are  currently  working  on 
combining  the  two. 

Another  related  effort  is  the  work  of  Washington  [13]  that  applies 
Kalman  Filters  to  this  problem.  In  this  work,  the  continuous  dynam¬ 
ics  in  each  mode  is  tracked  by  a  set  of  Kalman  filters.  The  main 
problem  with  the  approach  is  that  the  number  of  filters  tends  to  in¬ 
crease  over  time  because  each  time  a  transition  is  made  to  a  state  the 
initial  conditions  for  the  filter  are  different,  and  filters  with  differ¬ 
ent  initial  conditions  cannot  be  combined.  This  is  not  a  problem  for 
particle  filter-based  approaches  because  the  particle  filters  can  rep¬ 
resent  arbitrary  distributions  over  the  parameter  values,  so  particles 
entering  a  state  with  two  different  sets  of  initial  conditions  will  form 
a  bi-modal  distribution.  As  we  said  above,  we  used  the  model  and 
data  from  this  paper  in  our  own  work.  We  see  fewer  errors  in  the 
mode  identification  with  our  approach  than  in  Washington’s  paper, 
although  we  are  sometimes  slower  to  identify  the  fault,  and  our  com¬ 
putational  requirements  are  somewhat  higher. 

As  we  said  in  the  introduction,  this  paper  is  intended  as  a  proof 
of  concept.  There  is  still  much  work  to  do  on  the  problem  of  how 
to  integrate  a  model  from  Livingstone  with  this  system  to  act  as  an 
oracle.  We  have  demonstrated  that  a  simple  look-ahead  search  per¬ 
forms  quite  well,  but  this  is  clearly  inadequate  for  large  diagnosis 
problems,  particularly  as  most  faults  can  occur  at  any  time,  so  the 
one  step  lookahead  is  unlikely  to  scale  to  very  large  problem.  We  are 
also  examining  a  number  of  other  approaches  to  improving  diagno¬ 
sis  with  particle  filters,  such  as  backtracking  when  prediction  is  poor, 
and  re-sampling  past  states  based  on  observations  that  occurred  more 
recently.  Finally,  we  are  investigating  how  a  diagnosis  system  of  this 
type  would  fit  with  the  CLARAty  rover  architecture  [12]  being  used 
for  future  NASA  missions  to  Mars. 
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Abstract 

Applying  model-based  diagnosis  techniques  to  sys¬ 
tems  that  exhibit  hybrid  behavior  presents  an  inter¬ 
esting  set  of  challenges  that  mostly  revolve  around 
interactions  of  the  continuous  and  discrete  compo¬ 
nents  of  the  system.  In  many  real  world  systems, 
the  overall  physical  plant  is  inherently  continuous, 
but  system  control  is  performed  by  a  supervisory 
controller  that  imposes  discrete  switching  behav¬ 
iors  by  reconfiguring  the  system  components,  or 
switching  controllers.  In  this  paper,  we  present  a 
case  study  of  an  aircraft  fuel  system,  and  discuss 
methodologies  for  building  system  models  for  on¬ 
line  tracking  of  system  behavior  and  performing 
fault  isolation  and  identification.  Empirical  stud¬ 
ies  are  performed  on  detection  and  isolation  for  a 
set  of  pump  and  pipe  failures. 

1  Introduction 

Most  present-day  systems  that  we  use  are  designed  to  be  re¬ 
pairable.  Failures,  either  physical  (hardware)  or  logical  (soft¬ 
ware),  and  the  resulting  maintenance  are  a  fundamental  part 
of  the  economics  of  ownership.  Fault  diagnosis  involves  the 
detection  of  anomalous  system  behavior  and  the  isolation  and 
identification  of  the  cause  for  the  deviant  behavior.  When  the 
system  includes  safety-critical  components,  failures  or  faults 
in  the  system  must  be  diagnosed  as  quickly  as  possible,  and 
their  effects  compensated  for  so  that  control  and  safety  can 
be  maintained.  The  term,  diagnostic  capabilities,  refers  to 
the  abilities  of  a  system  to  detect  a  failure  and  isolate  it  to 
a  failed  unit.  Quick  detection  and  isolation  allows  for  quick 
corrective  actions  that  may  include  reconfiguration  of  system 
functions  to  prevent  damage  and  maintain  control. 

Fault  accommodation  requires  tight  integration  of  online 
fault  detection,  isolation,  and  identification  with  the  system 
control  loop  that  may  be  designed  to  take  appropriate  control 
actions  to  mitigate  the  effect  of  the  faults  and  help  maintain 
nominal  system  operation.  Failure  to  detect  faults  reduces 
system  availability,  results  in  failed  or  incomplete  missions, 
and,  in  some  cases,  may  even  lead  to  catastrophic  failures  that 
lead  to  loss  and  destruction  of  the  system.  Therefore,  fault 
diagnosis  is  critical  to  achieving  system  performance  and  life- 
cycle  cost  objectives. 
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In  general,  systems  are  dynamic,  i.e.,  their  behavior 
changes  over  time.  Faults  impose  additional  transients  on 
the  dynamic  behavior,  but  that  may  be  difficult  to  detect  and 
characterize,  especially  in  the  presence  of  model  disturbances 
and  noisy  measurements.  Moreover,  in  physical  systems  nat¬ 
ural  feedback  from  the  system  and  controller  actions  may 
mask  the  transient  behavior  if  they  are  not  detected  soon  after 
they  occur.  This  motivates  the  development  and  use  of  online 
model -based  fault  detection  and  isolation  methods.  Model- 
based  techniques  employ  a  model  to  predict  nominal  system 
behavior.  The  model  must  be  constructed  at  a  level  of  detail 
where  system  behavior  can  be  mapped  to  system  components 
and  parameters.  The  relations  in  the  model  are  employed  to 
map  observed  deviations  between  measurements  and  values 
predicted  by  the  model  to  possible  faults  in  system  compo¬ 
nents.  Continued  monitoring  helps  establish  a  unique  fault  or 
set  of  faults  associated  with  the  system. 

Most  real-life  systems  are  equipped  with  a  limited  num¬ 
ber  of  sensors  to  track  system  behavior,  and  analytic  redun¬ 
dancy  methods  have  to  be  applied  to  derive  non-local  in¬ 
teraction  between  potential  faults  and  observations.  These 
techniques  have  been  applied  to  a  variety  of  schemes  used 
in  the  diagnosis  of  discrete  IdeKleer  and  Williams,  1987], 
discrete  event  [Lunze,  1999;  Sampath  et  al.,  1996]  and 
continuous  systems  [Gertler,  1997;  Mosterman  and  Biswas, 
1999],  The  traditional  approach  to  hybrid  system  diagno¬ 
sis  has  been  to  use  a  single  continuous  model  with  complex 
non-linearities,  or  abstracting  the  continuous  dynamics  to  a 
discrete  event  model.  Complex  non-linearities  complicate 
the  analysis  and  they  may  introduce  numerical  convergence 
problems.  Discrete  event  abstractions  lead  to  loss  of  criti¬ 
cal  information,  such  as  fault  transient  characteristics.  Fur¬ 
ther,  methods  to  identify  the  set  of  events  that  describe  both 
nominal  and  faulty  behavior  is  often  a  computationally  chal¬ 
lenging  task  bringing  to  question  the  scalability  of  such  ap¬ 
proaches.  Hybrid  system  analyses  require  the  use  of  multi¬ 
ple  models  of  the  system.  Recent  approaches  to  hybrid  sys¬ 
tem  diagnosis  have  incorporated  appropriate  model  selection 
and  mode  estimation  techniques  at  run  time  to  track  faulty 
behavior  and  perform  fault  isolation  [Mcllraith  et  al.,  2000; 
Hofbaur  and  Williams,  2002;  Narasimhan  and  Biswas,  2001; 
2002], 

Model-based  diagnosis  techniques  can  only  be  as  good  as 
the  models  upon  which  they  are  based.  Incomplete  and  incor- 


rect  models  cause  problems  with  the  tracking  and  fault  isola¬ 
tion  tasks.  The  tracking  process  may  produce  false  alarms  or 
worse  missed  alarms.  In  the  first  case,  diagnosis  is  triggered 
when  there  is  no  fault  in  the  system.  In  the  second  situation, 
diagnosis  is  not  triggered  and  a  fault  may  be  missed.  Fault 
isolation  with  incomplete  and  inaccurate  models  may  also 
produce  false  candidates  and  miss  true  candidates.  On  the 
other  hand,  building  models  that  include  unnecessary  detail 
may  increase  computational  complexity  making  online  pro¬ 
cessing  infeasible.  Therefore,  a  key  task  in  model-based  di¬ 
agnosis  is  to  build  accurate  models  at  the  right  level  of  detail. 
This  paper  focuses  on  the  pragmatics  of  model  building  and 
fault  isolation  by  performing  a  case  study  on  the  fuel  transfer 
system  of  an  aircraft. 

2  Fuel  System  Description 

High-performance  aircraft  require  sophisticated  control  tech¬ 
niques  to  support  all  aspects  of  operation,  such  as  flight  con¬ 
trol,  mission  management,  and  environmental  control.  An 
aircraft’s  fuel  transfer  system  maintains  the  required  flow 
of  fuel  to  the  engines  through  different  modes  of  operation, 
while  ensuring  that  imbalances  are  not  created  that  affect  cen¬ 
ter  of  gravity  of  the  system.  Fig.  1.  illustrates  a  typical  fuel 
system  configuration.  The  fuel  system  geometry  is  symmet¬ 
ric  and  may  be  split  into  left  side  and  right  side  arrangements. 
The  overall  system  can  be  divided  into  two  main  sub-systems: 
(i)  the  engine  feed  system,  and  (ii)  the  transfer  system.  The 
feed  system  consists  of  a  left  and  right  engine  feed  tank.  The 
tanks  are  connected  through  pipes  with  controlled  valves  so 
that  fuel  can  be  transferred  between  the  tanks  if  a  fault  occurs 
in  one  of  the  tanks.  A  boost  pump  in  each  of  the  feed  tanks 
controls  the  supply  of  fuel  from  the  tank  to  its  respective  en¬ 
gine.  The  transfer  system  moves  fuel  from  the  two  forward 
fuselage  and  the  two  wing  tanks  to  the  engine  feed  tanks.  The 
intent  is  to  keep  the  engine  feed  tanks  near  full  at  all  times  so 
that  sufficient  fuel  is  available  on  demand,  and  if  failures  oc¬ 
cur  in  the  transfer  system  there  is  still  a  significant  amount 
of  fuel  available  for  emergency  maneuvers.  The  fuel  trans¬ 
fer  sequence  is  set  up  in  a  way  that  maintains  the  aircraft’s 
center  of  gravity.  To  achieve  this,  pumps  located  in  the  fuse¬ 
lage  and  wing  tanks  are  are  turned  on  in  a  pre-determined 
sequence  to  transfer  their  fuel  to  a  common  transfer  manifold 
(set  of  tubes).  The  fuel  exits  the  transfer  manifold  through 
level  control  valves  into  the  feed  tanks. 

A  wide  variety  of  sensors  may  be  included  in  the  fuel 
transfer  system.  Fuel  quantity  gauging  sensors  determine  the 
amount  of  fuel  in  a  tank.  Engine  fuel  flow  meters  determine 
engine  fuel  consumption.  Pressure  transducers  measure  the 
transfer  and  boost  pump  pressures.  Position  sensors  deter¬ 
mine  the  open  and  closed  states  of  valves.  Each  sensor  comes 
at  a  cost  that  is  determined  by  its  weight,  reliability,  complex¬ 
ity,  and  cost.  Therefore,  designers  often  try  to  minimize  the 
number  of  sensors  while  ensuring  that  the  required  control 
can  be  achieved. 

The  transfer  system  control  schedules  the  pump  operation 
to  match  a  pre-defined  transfer  sequence  shown  in  Table  1. 
The  unit  of  the  amounts  in  the  table  is  the  pound.  Initially 
one  wing  pump  in  each  tank  is  tinned  on.  When  a  feed  tank 


Right  Wing  Tank 


Fuel  Quantity  Sensor 


Figure  1:  Fuel  System  Schematic 


quantity  decreases  by  1 00  lbs,  the  level  control  valve  in  that 
tank  will  be  opened.  The  fuel  then  flows  from  the  transfer 
manifold  into  the  feed  tank  raising  its  level  back  to  the  full 
fuel  quantity  at  which  point  the  level  control  valve  will  be 
closed,  stopping  the  fuel  transfer. 


Table  1:  Fuel  Transfer  Sequence 

Left  Wing 
Tank 

Right 

Wing  Tank 

Left  Fuse¬ 
lage  Tank 

Right  Fuse¬ 
lage  Tank 

2500 

2500 

3300 

3000 

2000 

2000 

3300 

3000 

2000 

2000 

3000 

3000 

1000 

1000 

2000 

2000 

0 

0 

1000 

1000 

0 

0 

0 

0 

The  most  common  failures  in  this  configuration  are  trans¬ 
fer  and  boost  pump  failures,  and  shutoff  valve  failures.  The 
transfer  pumps  have  two  primary  failure  modes.  One  is  a  to¬ 
tal  loss  of  pressure  caused  by  the  impeller  not  turning.  The 
other  is  a  degraded  state  caused  by  mechanical  wear,  leakage, 
or  electrical  failure  where  the  fuel  flow  rate  falls  below  nom¬ 
inal  values.  The  second  failure  can  lead  to  the  first  condition 
over  time.  Faults  in  the  boost  pump  mirror  those  in  the  trans¬ 
fer  pumps.  Valve  failures  are  stuck-at  conditions,  i.e.,  their 
positions  do  not  change  even  when  they  are  commanded  to 
do  so.  This  can  result  from  mechanical  friction/jamming  of 
the  shaft  or  electrical  failure  of  the  motor  or  power  source.  In 
this  work,  we  also  consider  partial  failures  of  the  valves.  A 
third  class  of  faults  that  we  consider  is  leaks  in  the  connecting 
pipes.  Our  goal  is  to  develop  an  online  diagnostic  system  for 
detection,  isolation  and  identification  of  these  faults. 


3  Component-based  Hierarchical  Modeling 
for  Diagnosis 

Complex  real-world  systems  are  made  up  of  a  number 
of  subsystems  and  components.  Hierarchical  component- 
based  approaches,  e.g.,  Statecharts  [Harel,  1987],  20sim  [van 
Amerongen,  2000],  and  Ptolemy  [Buck  et  al.,  1994],  are  a 
practical  approach  to  constructing  models  of  such  systems, 
We  have  developed  a  new  methodology  for  hierarchical  com¬ 
ponent  based  modeling  that  customizes  the  graphical  Generic 
Modeling  Environment  ( GME )  with  a  hybrid  bond  graph 
( HBG )  approach  for  building  hybrid  models  of  physical  sys¬ 
tems  with  supervisory  controllers.  This  section  reviews  our 
approach  to  hybrid  bond  graph  modeling,  then  presents  the 
GME  methodology  for  building  component-based  models  for 
the  aircraft  fuel  transfer  system. 

3.1  Hybrid  Bond  Graphs 

Our  approach  to  modeling  the  fuel  system  is  based  on  an  ex¬ 
tended  form  of  bond  graphs  [Karnopp  et  al.,  1990],  called 
Hybrid  Bond  Graphs  (HBG)  [Mosterman  and  Biswas,  1998]. 
Bond  graphs  present  a  methodology  for  energy-based  mod¬ 
eling  of  physical  systems.  Generic  bond  graph  components 
represent  primitive  processes,  such  as  the  energy  storage  ele¬ 
ments,  inertias  and  capacitors ,  and  dissipative  elements,  re¬ 
sistors.  Bonds  represent  the  energy  transfer  pathways  in  the 
system.  Junctions,  which  are  of  two  types:  1  or  series ,  and 
0  or  parallel ,  define  the  component  interconnectivity  struc¬ 
ture,  and  impose  energy  conservation  laws.  Overall,  the  bond 
graph  topology  implies  system  behavior  that  combines  indi¬ 
vidual  component  behaviors  based  on  the  principles  of  conti¬ 
nuity  and  conservation  of  energy. 

Extensions  to  hybrid  systems  require  the  introduction  of 
discrete  changes  in  the  model  configuration.  In  the  HBG 
framework,  discontinuities  in  behavior  are  dealt  with  at 
a  meta  level,  where  the  energy  model  embodied  in  the 
bond  graph  scheme  is  suspended  in  time,  and  discontinuous 
model  configuration  changes  are  modeled  to  occur  instanta¬ 
neously.  Therefore,  the  meta  level  describes  a  control  struc¬ 
ture  that  causes  changes  in  bond  graph  topology  using  ide¬ 
alized  switches  that  do  not  violate  the  principles  of  energy 
distribution  in  the  system.  Topology  changes  result  in  a  new 
model  configuration  that  defines  future  behavior  evolution. 
To  ensure  physical  principles  are  not  violated,  we  have  de¬ 
veloped  transformations  that  derive  the  initial  system  state  in 
the  new  configuration  from  the  old  one.  From  this  point  on 
behavior  evolution  is  continuous,  till  another  discrete  change 
is  triggered  at  the  meta  level. 

To  keep  the  overall  behavior  generation  consistent,  the 
meta-model  control  mechanism  and  the  energy-related  bond 
graph  models  are  kept  distinct.  The  switching  structure  is  im¬ 
plemented  as  localized  switched  junctions  that  provide  a  com¬ 
pact  representation  of  the  system  model  across  all  its  nomi¬ 
nal  modes  of  operation.  Instead  of  pre-enumerating  the  bond 
graph  for  each  mode,  the  HBG  uses  individual  junctions  to 
model  local  mode  transitions.  The  switched  0-  and  1-  junc¬ 
tions  represent  idealized  discrete  switching  elements  that  can 
turn  the  corresponding  energy  connection  on  and  off.  A  finite 
state  machine  determines  the  ON/OFF  physical  state  of  the 


Figure  2:  Hybrid  Bond  Graph  Example 


junctions.  The  transitions  in  this  automaton  depend  on  both 
control  signals  and  internal  variable  values. 

Fig.  2  shows  the  hybrid  bond  graph  model  of  a  portion  of 
the  fuel  system.  The  dotted  subsystem  represents  the  wing 
tank,  and  the  dashed  subsystem  represents  the  fuselage  tank. 
In  this  simplified  model,  the  tank  system  is  modeled  as  a  ca¬ 
pacitor  for  storage  of  fuel,  pump  system  as  an  effort  source 
to  boost  the  pressure  and  create  an  outflow,  and  pipes  as  con¬ 
duits  with  resistive  losses.  For  this  configuration  with  two 
switched  junctions,  the  system  can  be  in  four  different  modes. 
When  the  two  junctions  are  off,  there  is  no  fuel  supplied  to  the 
feed  tank,  one  of  the  two  tanks  (wing  or  fuselage)  can  supply 
fuel  to  the  feed  tank,  and  both  tanks  may  supply  fuel  to  the 
feed  tank  at  the  same  time.  Switching  of  configurations  is 
achieved  by  changing  the  switching  signal  values.  Suppose 
the  wing  tank  is  supplying  fuel,  i.e.,  signal  i  =  1  (on)  and 
signal  =  0  (off).  To  switch  supplying  tanks,  we  simply  set 
signali  =  0  (off)  and  signal  =  1  (on).  The  state  equation 
model  for  the  new  configuration  can  be  easily  derived  online 
using  a  standard  algorithm  [Karnopp  et  al.,  1990]. 

3.2  GME 

We  have  developed  an  approach  for  building  component- 
based  system  models  using  a  graphical  modeling  tool  called 
Generic  Modeling  Environmentf GME)  [Ledeczi  et  al.,  2001  ]. 
GME  is  a  configurable  toolkit  for  creating  domain-specific 
modeling  and  program  synthesis  environments.  The  con¬ 
figuration  is  accomplished  through  meta-models 1  specifying 
the  modeling  paradigm  (modeling  language)  of  the  applica¬ 
tion  domain.  The  modeling  paradigm  contains  the  syntac¬ 
tic,  semantic,  and  visual  presentation  information  of  the  do¬ 
main,  such  as  the  concepts  that  form  the  building  blocks  for 
constructing  models,  the  relationships  among  these  concepts, 
how  the  concepts  may  be  organized  and  viewed  by  the  mod¬ 
eler,  and  the  rules  governing  the  composition  of  individual 
concepts  and  sets  of  concepts  to  form  component  and  system 

1  The  concept  of  meta-models  in  GME  differs  from  the  meta  level 
switching  models  in  HBG. 


models.  The  modeling  paradigm  defines  the  family  of  models 
that  can  be  created  using  the  resultant  modeling  environment. 

The  meta-models  specifying  the  modeling  paradigm  are 
employed  to  automatically  generate  the  target  domain¬ 
modeling  environment,  e.g.,  the  HBG  environment.  The  gen¬ 
erated  modeling  environment  is  then  used  to  build  domain 
models  that  are  stored  in  a  model  database.  The  primarily 
graphical,  domain  models  can  be  conveniently  stored  in  stan¬ 
dard  formats  including  XML  to  be  used  by  specific  applica¬ 
tions.  We  have  developed  a  GME  modeling  paradigm  that 
describes  the  HBG  modeling  environment. 

3.3  Hierarchical  and  Compositional  Modeling  in 
the  fluid  domain 

Real  life  systems  with  embedded  control  tend  to  be  complex, 
and  system  designers  and  engineers  typically  have  a  lot  of 
difficulty  in  generating  flat  models  of  the  entire  system.  Hi¬ 
erarchical  and  compositional  modeling  are  useful  tools  that 
allow  the  system  to  be  constructed  in  a  structured  way  by 
modeling  subsystems  independently  and  composing  them  to 
generate  system  models.  The  two  main  steps  in  this  approach 
are:  (i)  decomposition  into  subsystems  and  building  mod¬ 
els  of  subsystems,  and  (ii)  specifying  interactions  between 
subsystems  and  using  composition  operators  to  build  system 
models.  This  approach  provides  a  number  of  advantages, 
such  as  simplicity  in  model  building,  independence  in  build¬ 
ing  subsystem  models,  and  modularity  and  reusability  of  the 
generated  components. 

To  model  the  fuel  transfer  system,  we  develop  models  of 
typical  components  in  the  fluid  domain,  such  as  tanks,  pipes, 
and  pumps.  Pipes  may  include  valves  that  regulate  flow. 
Pumps  and  valves  can  be  turned  on  and  off.  We  assume  that 
their  switching  time  constants  are  much  faster  than  the  time 
constants  in  the  fluid  domain.  Therefore,  pumps  and  pipes 
with  valves  are  modeled  as  hybrid  systems.  In  our  GME 
paradigm,  subsystems  are  modeled  as  components.  Interac¬ 
tions  between  the  components  are  specified  as  relations  be¬ 
tween  their  in-ports  and  out-ports.  Components  connected 
by  ports  define  the  system  model.  The  ports  can  model:  (i) 
energy  transfer  between  components  in  the  bond  graph  frame¬ 
work,  and  (ii)  communication  of  information  by  signals.  Sig¬ 
nals  are  assumed  to  have  no  energy  content. 

For  this  work,  we  build  first  order  linear  models2.  Tanks 
are  modeled  as  a  bond  graph  segment  with  a  capacitor  con¬ 
nected  to  a  0  junction.  Each  tank  component  can  have  multi¬ 
ple  in-ports  and  out-ports.  In-ports  have  energy  connections 
(bonds)  to  the  0  junction  and  out-ports  have  bonds  from  the 
0  junction.  Ports  may  be  marked  as  in  and  out  based  on  a 
conventional  direction  for  energy  flow,  but  this  does  not  mean 
that  energy  cannot  flow  in  the  opposite  direction.  In  case  there 
is  an  energy  flow  in  the  opposite  direction,  the  corresponding 
variable  takes  on  a  negative  value. 

Pipes  are  modeled  as  resistors  connected  to  a  1  junction. 
Pipes  have  exactly  one  in  port  and  one  out-port  that  can  be 
connected  to  ports  of  other  tanks  and  pipes.  The  switching 
on  the  pipes  is  achieved  by  specifying  switching  signals  on 
the  1  junction  connected  to  the  resistor.  As  discussed  earlier, 

2In  reality  the  pressure  flow  relations  are  nonlinear. 


pumps  are  modeled  as  an  effort  source  connected  to  a  trans¬ 
former,  which  is  connected  to  a  0  junction.  Pumps  have  one 
out-port  for  representing  the  pressure  delivered  by  the  pump. 
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Figure  3:  Hierarchical  and  Compositional  Modeling 

As  an  example,  the  left  wing  tank  is  connected  to  the  left 
feed  tank  by  instantiating  two  tank  components,  one  pipe 
component,  and  one  switched  pipe  component.  The  switched 
pipe  controls  the  flow  into  the  feed  tank.  The  out  port  of  the 
first  tank  (left  wing  tank)  is  connected  to  the  in  port  of  the 
pipe  and  out-port  of  the  pipe  is  connected  to  the  in-port  of  the 
second  pipe.  Since  the  pump  is  modeled  to  pull  fuel  out  of  the 
left  wing  tank,  we  connect  the  out  port  of  a  pump  component 
to  the  in  port  of  the  pipe.  Fig.  3  illustrates  the  component 
based  and  hierarchical  model  of  this  subsystem  and  the  un¬ 
derlying  model  of  the  some  of  the  components. 

3.4  Modeling  for  diagnosis 

Models  form  the  core  component  of  the  tracking  and  diag¬ 
nosis  algorithms  [Biswas  and  Yu,  1993;  Narasimhan  et  al., 
2000].  The  hybrid  observer  uses  quantitative  state  space 
models  for  tracking  nominal  system  behavior,  the  fault  iso¬ 
lation  and  identification  unit  uses  temporal  causal  graphs 
( TCG )  for  qualitative  analysis  and  input  output  equation 
(IOE)  models  for  quantitative  parameter  estimation.  We  have 
devised  schemes  to  systematically  derive  these  model  repre¬ 
sentations  from  the  HBG  models  created  in  GME. 

Precise  tracking  of  nominal  system  behavior  requires  the 
component  parameter  values  in  the  bond  graph  model  be  ac¬ 
curately  estimated.  We  describe  our  parameter  estimation 
methodology  in  the  next  section.  For  fault  isolation  and  iden¬ 
tification,  there  has  to  be  a  a  one  to  one  correspondence  be¬ 
tween  faults  and  parameters  in  the  model.  For  example,  if  we 
abstract  a  pump  model  and  represent  it  as  an  effort  source, 
we  cannot  differentiate  among  faults  in  the  electrical  versus 
mechanical/fluid  part  of  the  pump.  Including  a  transformer 
component  that  models  the  electrical  to  fluid  energy  trans¬ 
formation  at  an  abstract  level  solves  this  problem.  A  partial 
fault  or  degradation  in  the  mechanical  part  of  the  pump  can 
be  attributed  to  a  change  in  the  transformation  parameter. 

Once  the  model  structure  has  specified  and  all  parame¬ 
ters  have  been  estimated,  the  hybrid  bond  graph  model  of 
the  complete  system  is  derived  by  composing  the  compo¬ 
nent  models  and  flattening  out  the  hierarchy.  The  designation 
of  ports  as  in-  and  out-ports,  and  restricting  connections  to 
be  from  out-ports  to  in-ports  only  ensures  the  consistency  of 
bond  connections  when  the  components  are  composed.  The 


LeftXferTank 


tin  OutB-Th 


LeftWingTank 


tin  OutB-fl-i 


Pipe 


Pump 


RightXferTank 


:b  In  Out  ■ 


outi-f  Pipe 


Pump 


*=i  In  Outs- 

01 J 

' 

SwitchedPipe  L&ftEngineFeedTank  Pipe 


TransferManifold 


Outi-I— *  Pipe 


Pump 


Bln  OutB- 

=  #=Bln  OutB- 

RightWingTank 

OutB- 

Pipe 

tin  OutB-ik 


LeftEngine 


BleedResistor 


-I  In  Outi- 


>*-Bln  OutB- 


SwitchedPipe  RightEngineFeedTank  Pipe 


RightEngine 


Pump 


Figure  4:  Component  Model  of  Fuel  System 


resulting  hybrid  bond  graph  may  be  used  to  systematically 
derive  the  state  space  and  temporal  causal  graph  model  of  the 
system.  In  the  bond  graph  framework,  each  element  describes 
equations  that  need  to  be  satisfied  for  that  component.  For  ex¬ 
ample,  for  a  0  junction  the  pressures  on  all  bonds  incident  is 
equal  and  net  flow  of  all  bonds  is  equal  to  zero.  The  proce¬ 
dure  to  convert  to  state  space  equations  may  be  summarized 
as  [Karnopp  et  ai,  1990]: 

1.  Identify  state  variables  (efforts  on  capacitors  and  flow  on 
inductors). 

2.  identify  input  variables  (effort  and  flow  sources). 

3.  Use  constituent  equations  of  the  bond  graph  components 
to  derive  the  relations  between  the  effort  and  flow  vari¬ 
ables  in  the  system. 

4.  Substitute  for  all  non-state  and  non-input  variables  to  de¬ 
rive  the  state  equation  model.  This  step  is  applied  repeat¬ 
edly  till  all  unnecessary  variables  are  eliminated. 

The  algorithm  to  derive  the  TCG  from  the  bond  graph  is  de¬ 
scribed  in  [Mosterman  and  Biswas,  1999]. 

3.5  Building  Models  of  the  Fuel  System 

From  our  discussion  in  earlier  sections,  the  primary  model 
building  steps  are:  (i)  identify  subsystems  and  model  them  at 
the  appropriate  level  of  detail,  (ii)  compose  system  models  by 
specifying  interactions  among  the  subsystems,  and  (iii)  esti¬ 
mate  parameters  of  the  model. 

As  discussed  earlier,  tanks,  pipes,  and  pumps  are  the  main 
components  of  the  fuel  system  model.  In  addition,  we  need 
to  build  models  for  the  transfer  manifold  and  the  engines.  For 
the  scenarios  we  deal  with,  it  was  sufficient  to  model  the  en¬ 
gines  as  constant  flow  sources,  i.e.,  a  sink.  Engines  have  one 
in-port  that  represents  the  flow  into  the  engine  from  the  feed 
tank.  The  transfer  manifold  is  modeled  as  a  single  capacitor 
connected  to  the  0  junction.  The  transfer  manifold  has  mul¬ 
tiple  in-ports  representing  flow  into,  and  multiple  out-ports 
representing  flow  out  of  the  transfer  manifold. 


The  next  step  is  to  determine  the  complete  system  config¬ 
uration.  For  the  fuel  system  we  instantiate  6  tanks:  2  wing 
tanks,  2  fuselage  tanks  and  2  engine  feed  tanks.  Each  has  a 
corresponding  pump.  Since  the  outlets  of  the  wing  and  fuse¬ 
lage  tanks  and  the  inlet  of  feed  tanks  all  have  valves,  we  cre¬ 
ated  switched  pipe  components  for  each  of  these  components. 
Two  instances  of  the  engine  are  created,  and  the  transfer  man¬ 
ifold  component  is  also  included  in  the  model.  Fig.  4  depicts 
our  component-based  GME  model  of  the  entire  fuel  system. 

The  individual  components  are  connected  using  bond 
graph  junctions  to  build  the  energy  model  of  the  overall  sys¬ 
tem.  The  fuselage  tanks  supply  the  transfer  manifold,  where 
the  flows  from  the  fuselage  tanks  sum  up.  This  is  modeled  by 
connecting  one  out-port  of  the  fuselage  tank  to  the  in-port  of  a 
pipe,  and  the  out-port  of  the  pipe  to  the  in-port  of  the  transfer 
manifold.  The  pump  associated  with  each  tank  is  also  con¬ 
nected  to  the  in  port  of  the  pipe.  This  develops  a  high  pres¬ 
sure  at  the  inlet  of  the  pipe,  and  hence  pulls  fuel  from  the  tank 
into  the  pipe.  The  flow  from  the  wing  tanks  and  the  transfer 
manifold  combine  and  distribute  evenly  to  the  left  and  right 
feed  tanks.  One  out-port  of  the  wing  tank  is  connected  to  the 
in  port  of  a  pipe.  The  out-port  from  these  pipes  and  the  out- 
port  from  the  transfer  manifold  are  connected  to  a  0  junction 
to  combine  the  flows.  The  0  junction  is  connected  to  the  in¬ 
port  of  switched  pipes  whose  out-ports  are  connected  to  the 
in-ports  of  the  feed  tanks.  In  order  to  maintain  stability  when 
both  feed  tanks  are  closed,  a  bleed  resistor  is  added  to  the 
piping.  This  resistor  bleeds  fuel  into  the  left  feed  tank.  The 
out-ports  of  the  feed  tanks  are  connected  to  the  in-ports  of  the 
corresponding  engines  through  pipes. 

The  next  step  is  to  estimate  the  model  parameter  values. 
For  the  scenarios  we  model,  the  engine  fuel  consumption  rate 
is  set  at  g  gpm  for  both  engines3.  All  other  parameters  are  es¬ 
timated  from  experimental  data  of  an  entire  fuel  burn  curve, 
where  all  the  fuel  from  the  wing  and  fuselage  tanks  was  con- 

3In  this  discussion  the  actual  numbers  are  not  used  to  avoid  any 
concerns  about  releasing  sensitive  data. 


sumed  by  the  engines.  We  used  the  rate  of  depletion  of  fuel 
in  the  tanks  and  the  flow  out  of  the  tanks  when  the  level  con¬ 
trol  valves  are  closed  to  calculate  the  individual  tank  capaci¬ 
tances.  For  the  left  feed  tank  the  fuel  depletion  rate  is  approxi¬ 
mately  d  Ibs/s ,  and  hence  we  determine  the  capacitance  of  the 
left  feed  tank  to  be  c;  - — .  Similarly,  the  right  feed  tank  ca- 

r  ,5  2 

pacitance  is  estimated  to  be  cr  J  j^c  .  We  performed  similar 
calculations  to  determine  the  wing  and  fuselage  tank  capac¬ 
ities  (approximately  cw  ^  j^c  ).  To  estimate  the  resistances, 
we  used  the  pressure  drop  and  flow  through  the  pipe  corre¬ 
sponding  to  the  resistance  to  calculate  the  resistance  value. 
The  pump  effort  and  efficiency  values  were  given  nominal, 
realistic  values. 

4  Diagnosis 


Figure  5:  Software  Architecture  for  Diagnosis 

The  diagnosis  task  involves  tracking  dynamic  system  be¬ 
havior  that  includes  continuous  evolution  plus  discrete  model 
changes  till  the  fault  detector  signals  a  fault.  At  this  point, 
the  fault  isolation  unit  is  invoked.  Discrete  mode  changes 
require  dynamic  switching  of  system  models,  and  may  also 
involve  discontinuous  changes  in  the  system  variables.  The 
fault  isolation  unit  also  needs  to  consider  change  in  modes 
when  matching  fault  signatures  with  actual  system  behavior. 
This  motivates  the  software  architecture  for  diagnosis,  illus¬ 
trated  in  Fig.  5.  The  input  to  the  diagnosis  system  is  the  model 
as  an  XML  file  and  the  experimental  data  as  a  text  file.  Each 
line  of  the  data  file  represents  one  sample  of  the  data.  Al¬ 
though  the  current  version  uses  a  data  file  as  input,  replacing 
it  with  data  from  an  actual  system  does  not  alter  the  rest  of  the 
architecture.  Each  sample  of  data  includes  all  input  values, 
all  measured  output  values,  and  the  values  of  all  switching 
signals.  The  output  of  the  diagnosis  module  is  the  set  of  diag¬ 
nostic  hypotheses  that  are  consistent  with  the  model  and  data. 
The  diagnosis  output  at  each  time  step  can  be  observed  in  a 
GUI  implemented  in  Python  (www.python.org) .  The  active 
state  model  (ASM)  is  an  internal  data  structure  that  maintains 
information  about  the  system  including  the  current  mode,  cur¬ 
rent  state  estimates,  and  current  diagnostic  hypotheses.  This 
structure  is  updated  at  each  time  step  from  information  re¬ 
ceived  from  the  observer  and  the  diagnosis  module.  The  hy¬ 
brid  bond  graph  (HBG)  data  structure  contains  the  flattened 
HBG  model  of  the  system  after  composition  of  all  active  com¬ 


ponent  subsystems.  The  HBG  model  also  contains  the  switch¬ 
ing  conditions  for  mode  changes.  These  are  parsed  and  stored 
in  the  ASM.  All  the  diagnosis  algorithms  modules  were  im¬ 
plemented  in  C++.  The  SWIG  toolkit  was  used  for  Python- 
C++  interactions. 

The  parser  reads  in  the  model  file,  interprets  it  and  cre¬ 
ates  the  HBG  data  structure.  The  symbolic  equation  gener¬ 
ator  (SEG)  takes  the  HBG  and  the  current  mode  of  the  sys¬ 
tem  and  derives  the  state  space  equation  (SSE)  model  of  the 
system,  which  is  stored  in  the  ASM.  When  tracking  system 
behavior,  the  hybrid  observer  reads  in  the  data  sample  for  the 
next  time  step  from  the  data  file,  and  checks  to  see  if  any  con¬ 
trolled  (specified  in  data  file)  or  autonomous  (stored  in  ASM) 
mode  changes  have  occurred.  When  mode  changes  occur, 
the  SEG  is  invoked  to  re-calculate  the  SSE  model.  To  ac¬ 
commodate  for  model  disturbances  and  measurement  noise, 
a  Kalman  filter  is  built  from  the  current  SSE  model  to  track 
system  behavior.  At  each  time  step,  the  fault  detector  com¬ 
pares  the  system  output  with  the  observer  estimates  to  deter¬ 
mine  if  a  fault  has  occurred  in  the  system.  When  the  fault 
detector  triggers,  the  diagnosis  module  is  activated.  The  di¬ 
agnosis  module  uses  propagation  algorithms  on  the  TCG  to 
generate  fault  candidates  that  are  consistent  with  the  observed 
discrepancies.  Continued  tracking  by  matching  the  fault  sig¬ 
natures  generated  for  each  candidate  hypotheses  helps  refine 
the  candidate  set.  For  details  on  the  hybrid  observer  and  di¬ 
agnosis  algorithms,  refer  to  [Mosterman  and  Biswas,  1999; 
Narasimhan  and  Biswas,  2002]. 

In  subsequent  sections  we  briefly  describe  the  hybrid  ob¬ 
server  and  the  diagnosis  modules  and  illustrate  their  function¬ 
ing  by  applying  them  to  a  diagnosis  problem  on  the  fuel  sys¬ 
tem.  In  the  experimental  setup,  the  fuel  system  is  controlled 
by  the  sequence  defined  in  Table  1 .  The  data  for  the  experi¬ 
ments  was  generated  using  a  Matlab/Simulink  simulation  that 
was  developed  at  Vanderbilt  University.  We  assume  pressure 
values  are  measured  at  the  output  of  each  of  the  six  tanks  of 
the  fuel  system.  The  fault  introduced  is  an  abrupt  decrease  in 
the  left  fuselage  tank  pump  efficiency  at  time  step  200.  This 
occurs  in  the  mode  when  only  the  left  fuselage  tank  is  sup¬ 
plying  fuel,  and  only  the  left  feed  tank  is  receiving  fuel. 

4.1  Hybrid  Observer  and  Fault  Detector 

The  hybrid  observer  tracks  the  system  behavior  as  it  evolves 
and  the  fault  detector  compares  the  observer  output  to  the  sys¬ 
tem  output  to  determine  if  a  fault  is  present  in  the  system.  The 
hybrid  observer  performs  the  following  tasks: 

•  Continuous  tracking  of  system  behavior  in  current  mode, 

•  Determining  if  a  mode  change  has  occurred,  and 

•  Initializing  the  observer  in  a  new  mode,  with  the  new 
state  and  new  model. 

The  discrete  time  form  of  the  SSE  models  are  derived  to 
track  system  behavior.  To  account  for  model  disturbances  and 
noisy  measurements,  we  use  a  Kalman  filter  to  track  system 
behavior.  This  requires  computation  of  the  R  and  Q  matrices 
that  model  the  disturbance  and  noise  variances,  and  K,  which 


represents  the  Kalman  gain  matrix. 
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In  our  experiments,  R  and  Q  are  diagonal  matrices  with  val¬ 
ues  of  0.01  along  the  diagonal.  The  Kalman  gain  (K)  is  ini¬ 
tialized  to  a  diagonal  matrix  of  arbitrarily  high  value  (100  in 
our  experiments).  This  gain  matrix  typically  converges  to  its 
true  value  in  a  few  time  steps. 

Mode  changes  may  be  of  two  types:  controlled  or  au¬ 
tonomous.  Controller  issued  switching  commands  need  to 
be  provided  in  the  data  file.  These  correspond  to  the  con¬ 
trolled  mode  changes  in  the  system.  At  each  time  step,  the 
observer  checks  to  see  if  the  data  set  indicates  a  mode  change. 
All  autonomous  change  conditions  are  converted  so  that  they 
contain  only  state  and  input  variables.  The  observer  uses  in¬ 
put  data  and  estimated  state  values  to  calculate  if  the  con¬ 
ditions  for  any  autonomous  change  evaluates  to  true.  This 
is  done  at  each  time  step  also.  For  the  fuel  system,  there 
are  no  autonomous  changes  and  hence  the  data  file  provides 
sufficient  information  to  determine  if  a  mode  change  has  oc¬ 
curred.  If  a  controlled  or  autonomous  mode  change  is  indi¬ 
cated,  the  observer  computes  the  new  mode.  The  equation 
solver  is  invoked  to  derive  the  new  SSE  model.  The  ob¬ 
server  re-initializes  the  state  based  on  the  reset  function  spec¬ 
ified,  and  continues  the  tracking  in  the  new  mode  with  a  new 
Kalman  filter  that  is  derived  from  the  A  and  B  matrices  in  the 
new  mode. 

Fig.  6  illustrates  a  sample  run  of  the  hybrid  observer  for  the 
experimental  setup  described  earlier.  Gaussian  noise  with  a 
2%  noise  variance  was  generated  using  the  Matlab  models  as 
described  earlier.  We  illustrate  the  tracking  of  pressure  in  the 
left  fuselage  tank.  The  thick  line  represents  the  noisy  system 
output  (it  is  more  like  a  waveform  than  a  line  due  to  the  noise 
in  the  measurements)  and  the  thin  line  represents  the  observer 
estimates.  Until  time  step  200  (at  which  point  the  fault  was 
introduced)  we  notice  that  this  line  is  completely  subsumed 
by  the  thick  line  indicating  accurate  tracking.  However,  after 
time  step  200  the  thin  line  deviates  from  the  thick  line  indicat¬ 
ing  a  fault.  The  fault  detector  (uses  a  5%  detection  threshold) 
flags  the  fault.  In  the  next  section,  we  describe  the  fault  iso¬ 
lation  scheme. 

4.2  Fault  Isolation  and  Identification 

Our  fault  isolation  and  identification  methodology,  described 
in  greater  detail  in  [Narasimhan  and  Biswas,  2002],  for  hy¬ 
brid  systems  is  broken  down  into  three  steps: 

1 .  A  fast  roll  back  process  using  qualitative  reasoning  tech¬ 
niques  to  generate  possible  fault  hypotheses.  Since  the 
fault  could  have  occurred  in  a  mode  earlier  than  the  cur¬ 
rent  mode,  fault  hypotheses  need  to  be  characterized  as 
a  two-tuple  (mode,  fault  parameter),  where  mode  indi¬ 
cates  the  mode  in  which  the  fault  occurs,  and  fault  pa¬ 
rameter  is  the  parameter  of  an  implicated  component 
whose  deviation  possibly  explains  the  observed  discrep¬ 
ancies  in  behavior. 


Figure  6:  Hybrid  Observer  Sample  Run 


2.  A  quick  roll  forward  process  using  progressive  moni¬ 
toring  techniques  to  refine  the  possible  fault  candidates. 
The  goal  is  to  retain  only  those  candidates  whose  fault 
signatures  are  consistent  with  the  current  sequence  of 
measurements.  After  the  occurrence  of  a  fault,  the 
observer’s  predictions  of  autonomous  mode  transitions 
may  no  longer  be  correct,  therefore,  determining  the 
consistency  of  fault  hypotheses  also  requires  the  fault 
isolation  unit  to  roll  forward  to  the  correct  current  mode 
of  system  operation. 

3.  A  real-time  parameter  estimation  process  using  quan¬ 
titative  parameter  estimation  schemes.  The  qualitative 
reasoning  schemes  are  inherently  imprecise.  As  a  result 
a  number  of  fault  hypotheses  may  still  be  active  after 
Step  2.  We  employ  least  squares  based  estimation  tech¬ 
niques  on  the  input-output  form  of  the  system  model  to 
estimate  consistent  values  of  the  fault  parameter  that  is 
consistent  with  the  sequence  of  measurements  made  on 
the  system. 

The  models  used  in  these  three  steps,  temporal  causal  graph 
(TCG)  and  input  output  equations  ( IOE )  model,  are  derived 
directly  from  the  hybrid  bond  graph. 

We  illustrate  the  diagnosis  algorithms  for  the  experiment 
discussed  in  the  previous  section.  As  Fig.  5  illustrates,  after 
time  step  200  the  actual  pressure  in  the  left  fuselage  tank  is 
below  the  predicted  value.  The  fault  detector  flags  this  and 
triggers  the  diagnosis  process.  We  use  the  roll  back  proce¬ 
dure  to  propagate  this  discrepancy  back  through  our  models 
to  generate  the  fault  candidates.  In  the  current  mode,  we  get 
the  following  candidates:  Left  Fuselage  Pump-,  Left  Fuselage 
Pipe+,  Transfer  Manifold+,  Bleed  Resistor+,  Left  Switched 
Pipe+,  Left  Feed  Pump-.  Since  the  left  fuselage  tank  was  not 
open  in  any  of  the  previous  modes,  no  candidates  are  gener¬ 
ated  in  any  previous  modes. 

The  next  step  rolls  forward  to  check  for  the  consistency 
of  the  effects  of  the  faults  hypothesized  against  actual  sys¬ 
tem  measurements.  Since  no  autonomous  mode  changes 
are  present  and  we  assume  that  all  controller  commands  are 
known,  we  know  exactly  what  mode  the  system  is  in.  We 
generate  signatures  (effects  of  fault)  in  that  mode  for  all  the 
above  candidates.  In  the  current  mode  we  cannot  distinguish 


between  the  candidates  because  they  have  similar  signatures. 
However,  when  a  mode  change  occurs  (right  feed  tank  is  also 
opened),  we  regenerate  signatures  in  the  new  mode  and  note 
that  Left  Switched  Pipe+  and  Left  Feed  Pump-  do  not  affect 
the  right  feed  tank  pressure.  However,  we  notice  a  discrep¬ 
ancy  in  the  right  feed  tank  pressure,  hence  we  can  drop  these 
candidates.  We  cannot  distinguish  between  the  other  candi¬ 
dates  (Left  Fuselage  Pump-,  Left  Fuselage  Pipe+,  Transfer 
Manifold+)  with  the  selected  set  of  measurements.  In  order 
to  distinguish  between  these  candidates  we  need  more  mea¬ 
surements.  For  example,  we  could  model  the  pump  in  more 
detail  and  add  a  sensor  to  measure  the  current  drawn  by  the 
pump  motor.  This  would  let  us  identify  faults  in  the  pump  as 
opposed  to  faults  in  pipes  that  the  pump  is  connected  to. 

Table  2  lists  the  different  fault  classes  in  the  fuel  system. 
Each  fault  class  represents  multiple  instances  of  the  faults 
in  the  same  component.  The  fault  classes  are  numbered  as 
follows:  1)  Wing  Tank  Pump  (WTP),  2)  Wing  Tank  Pipe 
Resistance  (WTR),  3)  Fuselage/Transfer  Tank  Pump  (TTP), 
4)  Fuselage/Transfer  Tank  Pipe  Resistance  (TTR),  5)  Trans¬ 
fer  Manifold  Resistance  (TMR),  6)  Switched  Pipe  Resistance 
(SPR,  7)  Feed  Tank  Pump  (FTP),  and  8)  Feed  Tank  Pipe  Re¬ 
sistance  (FTR).  The  results  of  our  diagnosis  experiments  for 
these  sets  of  faults  appear  in  the  table.  The  \J  mark  in  row  i 
and  column  j  indicates  that  if  the  roll  back  process  generated 
candidates  i  and  j,  one  of  them  will  be  droppedby  the  roll  for¬ 
ward  process.  The  x  mark  indicates  that  the  current  control 
sequence  and  set  of  measurements  are  not  sufficient  to  distin¬ 
guish  between  the  pair  in  question.  In  general,  the  algorithm 
cannot  distinguish  between  pump  and  pipe  faults  associated 
with  the  same  tank.  We  need  more  detailed  models  and  more 
measurements  to  do  this.  We  also  see  that  we  cannot  distin¬ 
guish  between  the  transfer  manifold  and  fuselage  pipe  faults. 
We  can  distinguish  between  all  other  fault  classes.  The  abil¬ 
ity  to  distinguish  between  all  fault  classes  is  critical  since  the 
change  in  control  strategy  depends  on  the  fault  type. 


Table  2:  Fuel  System  Fault  Diagnosability 


5  Conclusions 

We  have  presented  a  case  study  on  modeling  a  real  system 
and  building  a  diagnosis  engine  for  the  system.  The  key 
to  successful  tracking  and  diagnosis  is  to  have  models  that 
are  topologically  correct,  with  parameter  value  estimates  that 
match  the  nominal  observed  behavior  so  as  not  to  generate 
false  alarms.  This  can  be  a  difficult  and  time  consuming  task, 
with  a  lot  of  experimental  data  being  required  to  build  accu¬ 


rate  models.  The  presence  of  noise  in  the  data  complicates 
the  tracking  and  fault  detection  task.  For  the  given  set  of 
measurements,  our  tracking  mechanisms  worked  well  with 
fault-free  data  provided  the  variance  of  the  added  Gaussian 
noise  was  limited  to  2%.  Part  of  the  reason  for  such  low 
noise  thresholds  was  the  use  of  a  naive  threshold-based  fault 
detector  in  these  experiments.  The  diagnosis  system  always 
generated  the  true  fault  hypothesis,  but  in  a  number  of  cases 
the  hypothesis  set  contained  more  than  one  fault  candidate. 
This  could  be  attributed  to  lack  of  detail  in  the  models  and  the 
need  for  more  measurements  in  the  analysis.  Also,  parameter 
estimation  was  not  included  as  part  of  the  experiment.  In  pre¬ 
vious  work  [Narasimhan  and  Biswas,  2002],  we  have  shown 
that  parameter  estimation  often  helps  to  isolate  the  true  fault. 

In  future  work,  we  would  like  to  build  more  detailed  mod¬ 
els  of  the  different  components  of  the  fuel  system  in  an  at¬ 
tempt  to  diagnose  a  larger  set  of  faults.  The  experiments  need 
to  be  extended  to  run  with  real  data  provided  from  Boeing,  as 
opposed  to  simulated  Matlab  data  that  we  generated  at  Van¬ 
derbilt  University.  We  would  also  like  to  run  sensitivity  anal¬ 
ysis  tests  to  the  diagnosis  system  performance  under  varying 
noise  and  disturbance  conditions.  Finally  we  would  like  to 
build  more  robust  techniques  for  fault  detection  and  parame¬ 
ter  estimation  to  combat  the  effects  of  noise  and  disturbance. 
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Abstract 

We  present  a  distributed  model-based  diagnostics 
architecture  for  embedded  diagnostics.  We  extend 
the  traditional  model-based  definition  of  diagnosis 
to  a  distributed  diagnosis  definition,  in  which  we 
have  a  collection  of  distributed  components  whose 
interconnectivity  is  described  by  a  directed  graph. 
Assuming  that  each  component  can  compute  a  local 
minimal  diagnosis  based  only  on  sensors  internal 
to  that  component  and  knowledge  only  of  its  own 
system  description,  we  describe  an  algorithm  that 
guarantees  a  globally  sound,  complete  and  minimal 
diagnosis  for  the  complete  system.  By  compiling 
diagnoses  for  groups  of  components  based  on  the 
interconnectivey  graph,  the  algorithm  efficiently 
synthesizes  the  local  diagnoses  computed  in  dis¬ 
tributed  components  into  a  globally-sound  system 
diagnosis  using  a  graph-based  message-passing  ap¬ 
proach. 

1  INTRODUCTION 

This  article  proposes  a  new  technique  for  diagnosing  dis¬ 
tributed  systems  using  a  model-based  approach.  We  assume 
that  we  have  a  system  consisting  of  a  set  of  inter-connected 
components,  each  of  which  computes  a  local  (component)  di¬ 
agnosis.1  We  adopt  the  structure-based  diagnosis  framework 
of  Darwiche  [8]  for  synthesizing  component  diagnoses  into 
globally-sound  diagnoses,  where  we  obtain  the  structure  from 
the  component  connectivity.  Unlike  previous  approaches  that 
compute  diagnoses  using  the  system  observations  and  a  sys¬ 
tem  description  [8;  10],  we  transform  the  component  diagno¬ 
sis  synthesis  into  the  space  of  minimal  diagnoses.  Assum¬ 
ing  that  each  component  can  compute  a  local  minimal  diag¬ 
nosis  based  only  on  sensors  internal  to  that  component  and 
knowledge  only  of  the  component  system  description,  we  de¬ 
scribe  an  algorithm  that  guarantees  a  globally  sound,  com¬ 
plete  and  minimal  diagnosis  for  the  complete  system.  This 

*Research  supported  in  part  by  The  Office  of  Naval  Research 
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'Note  that  one  can  compute  component  diagnoses  using  any 
method  which  returns  a  minimal  diagnosis  (with  respect  to  a  speci¬ 
fied  minimality  criterion). 


algorithm  uses  as  input  the  directed  graph  (digraph)  describ¬ 
ing  the  connectivity  of  distributed  components, with  arc  di¬ 
rectionality  derived  from  the  causal  relation  between  the  the 
components.  Given  that  real-world  graphs  of  this  type  are 
either  tree-structured  or  can  be  converted  to  tree-structured 
graphs,  we  propose  a  graph-based  message-passing  algorithm 
which  passes  diagnoses  as  messages  and  synthesizes  local  di¬ 
agnoses  into  a  globally  minimal  diagnosis  in  a  two-phase  pro¬ 
cess.  By  compiling  diagnoses  for  collections  of  components 
(as  determined  by  the  graph’s  topology),  we  can  significantly 
improve  the  performance  of  distributed  embedded  systems. 
We  show  how  this  approach  can  be  used  for  the  distributed 
diagnosis  of  systems  with  arbitrary  topologies  by  transform¬ 
ing  such  topologies  into  trees. 

One  important  point  to  stress  is  that  this  approach  synthe¬ 
sizes  diagnoses  computed  locally,  and  places  no  restriction  on 
the  technique  used  to  compute  each  local  diagnosis  (e.g.,  neu¬ 
ral  network,  Bayesian  network,  etc.),  provided  that  each  local 
diagnosis  is  a  least-cost  or  most-likely  diagnosis.  The  syn¬ 
thesis  approach  takes  this  set  of  self-diagnosing  sub-systems, 
together  with  the  connectivity  of  these  sub-systems,  to  com¬ 
pute  globally-consistent  diagnoses. 

The  approach  presented  in  this  article  assumes  that  all 
faults  are  diagnosable  (i.e.,  can  be  isolated)  through  a  central¬ 
ized  algorithm.  We  examine  whether  a  distributed  approach 
can  diagnose  all  faults,  since  a  distributed  algorithm  can  iso¬ 
late  faults  no  better  than  a  centralized  algorithm.  Issues  re¬ 
lating  to  restricted  diagnosability  of  both  centralized  and  dis¬ 
tributed  algorithms  due  to  insufficient  observable  data  (e.g., 
when  the  suite  of  sensors  is  insufficient  to  guarantee  complete 
diagnosability)  are  examined  in  [21], 

This  article  is  organized  as  follows.  Section  2  introduces 
the  application  model  that  we  use  to  demonstrate  our  ap¬ 
proach.  Section  3  introduces  our  modeling  formalism,  and 
specifies  our  notion  of  centralized  and  distributed  model. 
Section  4  describes  how  we  diagnose  distributed  models. 
Section  5  surveys  some  related  work  on  this  topic.  We  sum¬ 
marize  our  conclusions  in  Section  6. 

2  IN-FLIGHT  ENTERTAINMENT 
EXAMPLE 

Throughout  this  article  we  use  a  simplified  example  of  an 


In-Flight  Entertainment  (IFE)  system.  Figure  1  shows  the 
schematic  for  an  IFE  system  fragment  where  we  have  (1)  a 
transmitter  module  (Tx)  that  generates  10  movie  channels 
(consisting  of  both  video  and  audio  signals)  and  10  audio 
channels;  (2)  two  area  distribution  boxes  (ADB);  and  (3)  at¬ 
tached  to  each  ADBi  we  have  two  passenger  units,  P,\  and 
Pj2  .  For  ADB  j,  passenger  i.  i  =  1,  2  has  a  controller  C'r, 
for  selecting  a  video  or  audio  channel,  plus  an  audio  unit  a  j 
and  video  display  Wj.  Control  signal  Cji  is  sent  by  passenger 
i  to  ADBj  and  then  to  the  transmitter,  which  in  turn  sends  an 
RF  signal  (RF)  to  each  passenger. 

We  adopt  a  notion  of  causal  influence  for  describing  how 
different  components  affect  the  value  of  a  signal  as  it  propa¬ 
gates  through  the  system.  For  example,  the  RF  signal  causally 
influences  the  passenger  audio  and  video  outputs.  In  this 
model  the  observables  are  the  control  signals,  plus  for  pas¬ 
senger  i  downstream  of  ADBj  sound  (Sr,)  and  video-display 
(V Dji).  We  assign  a  fault-mode  to  the  transmitter  and  to  each 
ADB  and  passenger  unit. 


Figure  1 :  Schematic  of  IFE  fragment,  showing  the  main  mod¬ 
ules  and  the  directed  arcs  of  data-flows. 

Our  modeling  approach  makes  the  following  assumptions. 
First,  we  can  specify  a  system  using  an  object-oriented  ap¬ 
proach.  In  other  words,  a  system  can  be  defined  as  a  col¬ 
lection  of  components,  which  are  connected  together,  e.g., 
physically,  as  in  an  HVAC  system,  or  in  terms  of  data  trans¬ 
mission/reception,  as  in  the  IFE  example.  Our  primary  com¬ 
ponent  consists  of  a  block,  which  has  properties:  input  set, 
output  set,  fault-mode,  and  equations.  Given  the  fault-mode 
and  input  set,  the  equations  provide  a  mapping  to  the  output 
set.  In  other  words,  the  inputs  are  the  only  nodes  with  causal 
arcs  into  the  block,  and  the  outputs  are  the  only  nodes  with 
causal  arcs  out  of  the  block.  Typically,  we  have  causal  depen¬ 
dence  of  block  outputs  u>i  on  inputs  £i,  i.e.  u>i  oc  £i.2 

This  distributed  model  consists  of  a  set  of  sub-models,  or 
blocks,  which  may  be  connected  together.  In  our  IFE  exam¬ 
ple,  the  transmitter  block  has  inputs  of  control  signals  C i  and 
C2,  and  output  an  RF  signal. 

Second,  we  assume  that  each  component  computes  diag- 

2  The  causal  function  oc  can  be  be  generalized  to  include  proposi¬ 
tions,  relations,  probabilistic  functions,  qualitative  differential  equa¬ 
tions,  etc.  We  don’t  address  such  a  generalization  here. 


noses  based  on  data  local  to  the  component.  We  do  not  place 
any  restrictions  on  the  type  of  algorithm  used  to  compute  the 
diagnosis,  except  that  the  diagnosis  be  a  least-cost  diagno¬ 
sis.  We  will  describe  the  cost  function  used  by  our  synthesis 
algorithm  in  the  following  section. 

3  MODEL-BASED  DIAGNOSTICS  USING 
CAUSAL  NETWORKS 

This  section  formalizes  our  modeling  and  inference  approach 
to  diagnostics  and  control  reconfiguration.  We  first  introduce 
the  model-based  formalism,  and  then  extend  these  notions  to 
capture  a  distributed  model-based  formalism. 

3.1  FLAT  (CENTRALIZED)  MODELS 

We  adopt  and  extend  the  model-based  representation  for 
diagnosis  of  Darwiche  [8],  We  model  the  system  using  a 
causal  network: 

Definition  1  A  system  description  is  a  four-tuple  <1>  = 

(V,  G ,  £),  where 

•  V  is  a  set  of  variables  comprising  two  variable  types: 
A  is  a  set  of  variables  (called  assumables)  representing 
the  failure  modes  of  the  components,  V  is  a  set  ofnon- 
assumable  variables  (V  D  A  =  0)  representing  system 
properties  other  than  failure  modes; 

•  G  is  a  directed  acyclic  graph  (DAG)  called  a  causal 
structure  whose  nodes  are  members  in  V  U  A  and  whose 
directed  arcs  represent  causal  relations  between  pairs  of 
nodes; 

•  and  E  is  a  set  of  propositional  sentences  (called  the  do¬ 
main  axioms)  constructed  from  members  in  VU  A  based 
on  the  topological  structure  ofG- 

This  definition  of  system  description  differs  from  the  stan¬ 
dard  definition  (called  SD  in  [22])  only  in  that  we  include 
a  graph  G  to  complement  the  domain  axioms  set  of  failure 
modes  (commonly  called  COMPS)  and  non-assumable  vari¬ 
ables. 

The  set  of  non-assumable  variables  consists  of  two  exclu¬ 
sive  subsets:  V0bs  (the  set  of  observables)  and  ~V  unobs  (the 
set  of  unobservables). 

We  can  capture  structural  properties  of  the  system  descrip¬ 
tion  using  the  directed  acyclic  graph,  or  DAG,  G ■ 3  For  exam¬ 
ple,  if  an  actuator  determines  if  a  motor  is  on  or  not,  we  say 
that  the  actuator  causally  influences  the  motor.  More  gener¬ 
ally,  A  may  directly  causally  influence  B  if  A  is  a  predecessor 
of  B  in  G ■  We  use  B  oc  A  to  denote  the  direct  causal  influence 
of  the  value  of  B  by  the  value  of  A.4  Through  transitivity,  we 
can  deduce  indirect  causal  influence.  For  example,  if  B  oc  A 
and  C  oc  B,  then  A  indirectly  influences  C. 

This  captures  the  notion  of  direct  causal  influence,  i.e.,  a 
node  N  and  those  nodes  that  are  directly  causally  affected  by 
N ,  using  a  clan.  We  define  the  notion  of  the  clan  of  a  node  N 
of  a  DAG  G  in  terms  of  graphical  relationships  as  follows: 

3  In  other  system  description  specifications,  e.g.  [12],  these  struc¬ 
tural  relations  are  captured  using  logical  sentences. 

4This  notion  of  causal  influence  does  not  guarantee  that  A  influ¬ 
ences  B,  but  that  A  may  influence  B. 


Definition  2  (Clan)  :  Given  a  DAG  Q,  the  clan  Y{Nf)  of  a 
node  Ni  G  Q  consists  of  the  node  Ni  together  with  its  children 
in  Q. 

We  adopt  the  notion  of  clan  because  we  are  interested  in 
synthesizing  diagnoses  computed  at  a  set  of  distributed  nodes 
organized  in  a  tree  structure.  The  intuition  behind  the  algo¬ 
rithm  is  as  follows:  given  local  diagnoses,  we  start  at  the  par¬ 
ents  of  leaves  in  the  decomposition  tree  and  move  up  the  tree 
to  the  root,  identifying  if  any  node’s  diagnosis  is  affected  by 
the  diagnoses  of  its  children,  and  if  so,  synthesizing  those  di¬ 
agnoses.  To  perform  each  synthesis  operation,  we  use  a  clan. 

A  clan  is  dual  to  the  well-known  notion  of  family,  which 
is  typically  defined  as  a  node  together  with  its  parents  in  Q. 
This  notion  is  important  because  we  need  to  synthesize  local 
diagnostics  within  tree-structured  systems,  and  the  clan  pro¬ 
vides  a  more  efficient  means  for  doing  so  than  the  family  for 
tree-structured  systems.  For  simplicity  of  notation,  we  will 
denote  the  clan  for  node  Nt,  Y (Nf),  as  Yt. 

It  is  also  important  to  define  restrictions  of  subsets  of  ob¬ 
servables: 

Definition  3  (Restriction)  We  denote  by  0 ,  the  restriction  of 
an  instantiation  9  of  variables  V  to  the  instantiation  of  a  sub¬ 
set  Vi  of  V.  We  denote  the  restriction  of  variable  set  T  to 
variables  in  sub-system  description  <I>i  by  T$i. 

One  of  the  key  elements  of  diagnosing  a  system  is  the  in¬ 
stantiation  of  observables,  since  a  diagnosis  is  computed  for 
abnormal  observable  instantiations. 

Definition  4  (Instantiation)  0Fi  is  an  instantiation  of  ob¬ 
servables  V0bs^'  for  system  description  <!>;.  @$i  denotes  the 
set  of  all  instantiations  of  observables  V0bs 

We  specify  failure-mode  instantiations  and  partition  the 
possible  states  into  normal  states  and  faulty  states  as  follows: 

Definition  5  (Mode-Instantiation)  A*  is  an  instantiation  of 
behavior  modes  for  mode-set  A.  Further,  we  decomposition 
A*  such  that  A*  =  AF  U  A®,  where  *4®  denotes  normal 
system  behaviour,  i.e.  all  modes  are  normal,  and  AF  denotes 
a  system  fault,  which  may  consist  of  simultaneous  faults  in 
multiple  components. 

An  assumable  (behavior-mode  variable)  specifies  the 
discrete  set  of  behavior-states  that  a  component  can 
have,  e.g.,  and  AND-gate  can  be  either  OK,  stuck-at- 
0 ,  or  stuck-at-1.  Our  IFE-system,  with  component-set 
{Tx,  ABD\,  ADB2,  Pn,  P12,  P21,  P22}.  can  have  a  mode- 
instantiation  in  which  all  components  are  OK  except  Pn, 
which  is  in  audio-fail  mode.  In  this  case  we  have  A®  = 
{Tx  —  mode  =  OK,  ABD\  —  mode  =  OK,  ADB2  — 
mode  =  OK,  Pi 2  —  mode  =  OK,  P21  —mode  =  OK,  P22  — 
mode  =  OK}  and  AF  =  {P11  —  mode  =audio-fail} . 

3.2  DISTRIBUTED  SYSTEM  DESCRIPTIONS 

This  section  describes  our  distributed  formalism,  which  ap¬ 
plies  to  collections  of  interconnected  components,  or  blocks. 
We  assume  that  a  distributed  system  description  is  provided 
either  by  the  user  or  is  deduced  from  the  physical  constraints 
of  available  local  diagnostic  agents  and  physical  connectiv¬ 
ity.  For  example,  many  engineering  systems,  such  as  com¬ 
mercial  aircraft,  are  subdivided  into  Line-Replaceable  Units 


(LRUs),  based  on  a  number  of  factors,  such  as  fault-isolation 
capabilities,  physical  constraints,  and  ease  of  repair.  An  LRU 
typically  consists  of  a  number  of  connected  sub-systems,  as 
in  the  Passenger  Unit  of  the  IFE  example,  which  consists  of 
circuit-cards  to  select  audio/video  channels  and  to  drive  the 
audio  and  video  output  devices.  It  is  standard  practice  in 
commercial  aircraft  to  isolate  faults  only  to  the  LRU-level, 
and  replace  faulty  components  only  at  the  LRU-level. 

Definition  6  (Decomposition  Function)  a  decomposition 
function  is  a  mapping  ip(f f>)  =  dist  that  decomposes  a 
centralized  system  description  <I>  into  a  distributed  system 
description  ^dist  =  {$1, $m}.  The  distributed  system 
description  induced  by  a  decomposition  function  ip  is  defined 
by  a  decomposition  II  over  the  system  variables  V,  i.e.  a 
collection  X  =  {Xi, ...,  Xrn }  of  nonempty  subsets  ofV  such 
that  (l)Vi=  1, ...,  m,  Xi  G  2V;  (2)  V  =  UfiX.^Xi  G  II). 
When  =  Xi  H  Xj  0,  we  call  £,;?  the  separating  set,  or 
sepset,  of  variables  between  <I>i  and  <1 ~>j. 

We  can  describe  a  distributed  system  description  in  terms 
of  a  decomposition  graph.  A  decomposition  graph  is  a  graph¬ 
ical  representation  of  the  system  model,  when  viewed  as  a 
collection  of  connected  blocks.  In  this  graph  each  vertex  cor¬ 
responds  to  a  block,  and  each  directed  edge  corresponds  to 
a  directed  (causal)  link  between  two  blocks.  Figure  2  shows 
the  decomposition  graph  for  the  extended  IFE  example. 5 

A  decomposition  graph  is  a  directed  tree,  or  D-tree,  which 
is  defined  as  follows: 

Definition  7  A  D  -tree  Tp  is  a  directed  graph  with  vertices 
V(Tp)  with  a  vertex  ro,  called  the  root,  with  the  property 
that  for  every  vertex  r  G  V(Tp)  there  is  a  unique  directed 
walk  from  r  0  to  r. 

Definition  8  A  decomposition  graph  G  x  is  an  edge-labeled 
D-tree  G(X,£ ,£)  with  ( 1 )  vertices  X  =  {Xi, ...,  Xm}, 
where  each  vertex  consists  of  a  collection  of  variables  of  Q, 
(2)  directed  edges  join  pairs  of  vertices  with  non-empty  in¬ 
tersections,  and  arc  direction  is  specified  by  the  causal  direc¬ 
tion  of  the  arcs  between  blocks  in  the  decomposition  graph, 
i.e.,  £  =  {{Xj,Xk)\ Xi  n  Xj  ±  0,  Xk  <x  Xj},  and  (3) 
edge  labels  (or  separators)  defined  by  the  edge  intersections, 

p-  {fj  X,  i.Yr/OI- 

We  assume  that  in  a  distributed  system  description,  for  any 
block  all  sensor  data  is  local,  and  all  equations  describing  dis¬ 
tributed  subsystems  refer  to  local  sensor  data  and  local  con¬ 
ditions. 

3.3  DIAGNOSIS  SPECIFICATION 

We  define  the  notion  of  diagnosis  as  follows: 

Definition  9  (Diagnosis)  Given  a  system  description  <1>  with 
domain  axioms  X  and  an  instantiation  9  of\0bs,  a  diagnosis 
D(9)  is  an  instantiation  of  behavior  modes  AF  U  A®  such 
that  X  U  9  U  AF  U  A®  -L. 

5 We  do  not  show  the  feedback  loops  of  control  requests 
(Ci,  C2,  Cn..,  C22)  since  all  edges  concerning  observables  can  be 
cut  [7], 


Figure  2:  Decomposition  graph  of  extended  IFE  system  de¬ 
scription.  Here  an  oval  corresponds  to  a  vertex,  and  a  block 
corresponds  to  a  sepset.  We  specify  the  variables  associated 
with  each  vertex  in  the  graph. 

This  diagnostic  framework  provides  the  capability  to  rank 
diagnoses  using  a  likelihood  weight  k,  assigned  to  each  as¬ 
sumable  Ai,  i  =  1,  ...,to.  Using  the  likelihood  algebra  de¬ 
fined  in  [8],  we  can  compute  the  likelihood  assigned  to  each 
diagnosis  for  observation  9.  We  refer  to  a  (diagnosis,  weight) 
pair  using  ( D(9 ),  k).  We  use  the  weights  to  rank  diagnoses, 
i.e.,  least-weight  diagnoses  are  the  most-likely.  This  provides 
a  notion  of  minimal  diagnosis,  i.e.  a  diagnosis  of  weight  k 
such  that  there  exists  no  lesser-weight  diagnosis. 

3.4  LOCAL/GLOBAL  DIAGNOSTICS 

Our  methodology  rests  on  the  determination  of  when  com¬ 
ponent  diagnoses  are  independent,  in  which  case  the  global 
diagnosis  is  just  the  conjunction  of  the  component  diagnoses. 
We  apply  the  decomposition  theorem  of  [8]  to  this  case  of 
distributed  diagnostics: 

Theorem  1  If  we  have  a  system  description  $  consisting  of 
two  component  system  descriptions  <!>i  and  4>2,  and  a  sys¬ 
tem  observation  9,  if  the  variables  shared  by  <!>i  and  $2  all 
appear  in  9,  then 

D*(0)  =  D®1  (0i)  A  D®2  {9f). 

This  theorem  states  that  a  diagnosis  is  decomposable  pro¬ 
vided  that  the  system  observation  contains  the  variables 
shared  between  <!>i  and  $2.  However,  what  happens  when 
the  observation  9  does  not  contain  all  variables  shared  be¬ 
tween  <!>i  and  $2?  One  solution  [8]  is  to  decompose  the  com¬ 
putation  of  I) by  performing  a  case-analysis  of  all  shared 
variables  £12.  However,  this  case-analysis  approach  is  expo¬ 
nential  in  |  £12 1 ,  the  number  of  variables  on  which  we  do  case- 
analysis.  Hence  if  we  wanted  to  embed  the  diagnostics  code, 
such  a  case-analysis  might  be  too  time-consuming  when  per¬ 
formed  on  a  system-level  model. 

In  the  following  we  assume  that  each  component  computes 
a  local  diagnosis,  i.e.,  a  diagnosis  based  only  on  local  ob¬ 
servables  and  on  equations  containing  only  local  variables.  In 
contrast  a  global  diagnosis  is  one  based  on  global  observables 
and  on  equations  describing  all  system  variables.  Our  task  is 


to  integrate  these  local  component  diagnoses  into  a  globally 
sound,  minimal  and  consistent  diagnosis,  since  for  many  sys¬ 
tems  the  diagnostics  generated  locally  are  either  incomplete 
or  not  minimal. 

Note  that  we  can  obtain  global  diagnostics  for  a  modular 
system  by  composing  local  blocks  and  diagnosing  the  entire 
system  model.  However,  it  is  true  in  many  cases  that  global 
and  local  diagnostics  may  differ.  We  now  define  a  notion  of 
correspondence  between  local  and  global  diagnoses. 

The  conjunction  of  the  set  of  distributed  system  descrip¬ 
tions  is  defined  as  Ddist(O)  —  A <f>keB  D® k  (0),  and  we  know 
that  Ddist(9)  =  D{9)  only  when  9  =  (J 

We  can  compute  the  diagnoses  for  this  set  of  distributed 
system  descriptions  either  using  an  on-line  algorithm,  or  by 
pre-computing  the  set  of  diagnoses  for  DdiSt(0).  In  the  fol¬ 
lowing,  we  outline  the  compiled  method  of  diagnosis. 

We  define  a  table,  called  a  clan  table,  to  specify  local  and 
global  diagnoses  for  collections  of  blocks.  This  table  com¬ 
piles  the  local  case-analysis  required  by  Theorem  1.  We  will 
show  later  how  to  use  this  table  for  our  diagnosis  synthesis 
algorithm. 

Definition  10  A  clan  (or  local/global  diagnosis)  table  for 

block-set  B  =  is  a  table  consisting  of  tuples 

( observable-intantiation,  global  diagnosis,  weight)  for  all  ab¬ 
normal  instantiations  of  observables  9  in  B. 

Note  that  we  can  use  the  compositionality  of  blocks  to 
show  that  any  time  we  compose  a  system  description  from 
multiple  blocks,  we  obtain  “global”  diagnostics  for  that  com¬ 
posed  system  description  when  we  compute  diagnoses  over 
the  composed  system  description.  Hence  the  “global”  diag¬ 
nosis  for  each  collection  of  blocks  is  computed  from  a  system 
description  generated  from  the  composition  of  the  system  de¬ 
scriptions  of  the  blocks  in  B,  using  the  observables  from  B. 

Example  1  Table  1  contrasts  the  local  and  global  diagnoses 
for  a  set  of  scenarios  where  the  set  B  of  blocks  is  an  ADB 
with  downstream  passenger  units.  In  these  scenarios,  we 
compute  the  (probabilistically)  most-likely  diagnosis,  assum¬ 
ing  that  all  faults  are  equally  likely,  i.e.,  have  weight  1.  More¬ 
over,  in  defining  a  local  diagnosis  in  Table  1,  we  report  the 
conjunction  of  all  local  diagnoses,  i.e.  the  local  diagnosis  is 
ADB -diagnosis  A  P\-diagnosis  A  I\ -diagnosis.  In  scenarios 
1,  2  and  4,  the  local  and  global  diagnoses  are  identical.  How¬ 
ever,  in  scenarios  3,  5  and  6,  they  differ:  the  passenger  units 
each  assume  a  local  fault,  whereas  the  transmitter  unit  is  the 
faulty  one  (since  a  single  transmitter  fault  is  much  more  likely 
the  two  simultaneous  faults,  one  in  each  passenger  unit).6 

Given  this  potential  for  discrepancy  between  local  and 
global  diagnoses,  we  map  the  decomposition  graph  into  a 
representation,  the  clan  graph,  from  which  we  can  synthesize 
globally  sound  and  complete  minimal  diagnoses  from  local 
minimal  diagnoses.  Figure  3  shows  the  clan  graph  for  the 
extended  IFE  example. 

‘’These  differences  arise  due  to  different  instantiations  of  the  RF 
signal  in  the  local  and  global  diagnosis.  We  hide  the  details  of  the 
case-analysis  of  shared  variables  for  simplicity  of  presentation. 


Scenario 

ADB 1  Unit 

Pass.  Unitu 

Pass.  Uniti2 

Diagnosis 

C 11 

C 12 

S11 

VD 11 

S12 

VD  12 

LOCAL 

GLOBAL 

1 

audio 

audio 

nom. 

none 

nom. 

none 

- 

- 

2 

audio 

audio 

none 

none 

nom. 

none 

Pi  1  -audio-fail 

Pi  1  -audio-fail 

3 

audio 

audio 

none 

none 

none 

none 

Pi  1  -audio -fail A  Pi  2  -audio-fail 

Xaudio 

4 

video 

video 

nom. 

nom. 

nom. 

none 

P12 -video-fail 

Pi  2 -video -fail 

5 

video 

video 

nom. 

none 

nom. 

none 

Pi  1  -video-fail A  Pi  2 -video-fail 

Xvideo 

6 

audio 

video 

none 

none 

none. 

none 

Pi  1  -audio-fail A  Pi  2 -video-fail 

AD  Bi  -fail 

Table  1:  Diagnostic  Scenarios.  We  denote  a  nominal  passenger  output  of  nominal  using  nom.,  and  abnormal  observable  data  in 
bold-face.  Xaudio  denotes  degraded  audio,  and  Xvideo  denotes  degrated  video. 


Definition  11  (Clan  graph)  :  A  clan  graph  Gy  of  a  DAG 

G(V,  E)  of  vertices  V  and  edges  E  is  an  edge-labeled  D-tree 
G(y,8 ,£)  defined  as  follows:  (l)vertices  y  =  {Yi, 
where  each  node  Yi  consists  of  a  clan  of  Q;  (2)  edges  de¬ 
fined  by  non-empty  intersections  between  pairs  of  vertices 
8  =  {(Yj,Yk)\Yi  n  Yj  ^  0};  and  (3)  separators  defined 
by  the  edge  intersections  £  =  {f  ,j  =  Yi  D  Yj}. 

The  following  section  shows  how  we  use  the  clan  graph  for 
distributed  diagnosis. 

4  DISTRIBUTED  MODEL-BASED 
DIAGNOSIS 

This  section  describes  our  distributed  model-based  diagnosis 
algorithm.  We  first  map  the  directed  graph  of  the  system  into 
a  tree  using  tree-decomposition  techniques,  and  then  employ 
a  message-passing  algorithm  on  the  tree. 

4.1  TREE-DECOMPOSITION 

The  work  on  tree -decomposition  stems  from  work  on 
treewidth  and  graph  minors  [23],  A  good  review  of  the  liter¬ 
ature  can  be  found  in  [5],  We  define  the  basic  notions  below. 

Definition  12  A  tree  decomposition  of  an  undirected  graph 
G  =  ( V. ,  E)  is  a  pair  (X,  T )  with  T  =  (/,  F )  a  tree,  and 
X  =  {Xi\i  £  1}  is  a  family  of  subsets  ofV,  one  for  each 
node  ofT,  such  that 

1 •  U6/*i  =  U; 

2.  for  all  edges  {u,tu}  £  E  there  exists  an  i  £  I  with 
v  £  X,  and  w  £  X,  ,  and 

3.  for  all  i,  j,  k  £  I  if  j  is  on  the  path  from  i  to  k  in  T,  then 

Xi  n  Xk  c  Xj. 

The  last  property  is  known  as  the  running-intersection  prop¬ 
erty  within  the  BN  community.  The  clique-tree  algorithm 


computes  a  tree-decomposition  in  which  each  node  of  the 
tree  is  a  clique,  and  undirected  edges  correspond  to  shared 
variables  between  cliques. 

Given  a  tree-decomposition,  inference  complexity  is  based 
on  the  treewidth,  defined  as  follows.  The  width  of  a  tree  de¬ 
composition  is  maxje/  |Xj  —  1.  The  treewidth  of  a  graph  G 
is  the  minimum  width  over  all  tree  decompositions  of  G.  The 
treewidth  bears  close  relations  to  the  maximal  vertex  degree 
and  maximal  clique  of  a  graph,  so  it  provides  a  measure  of 
the  complexity  of  diagnostic  inference,  among  other  things. 
If  a  graph  has  a  low  treewidth  then  inference  on  the  graph 
is  guaranteed  to  be  easy.  The  task  of  computing  treewidth  is 
NP-hard  [2],  Many  algorithms  exist  that,  given  a  graph  with  n 
variables,  will  compute  an  optimal  treewidth  in  time  polyno¬ 
mial  in  n  but  exponential  in  the  treewidth  see,  for  example, 
[4], 

Directed  Tree-Decomposition 

The  difference  between  the  standard  literature  on  tree- 
decompositions  and  the  task  addressed  here  is  that  the  stan¬ 
dard  literature  focuses  on  undirected  graphs,  and  we  focus  on 
directed  graphs.  We  capture  and  exploit  the  directionality  of 
causal  relations  during  all  phases  of  diagnostic  inference.  For 
example,  if  we  have  an  abstract  hierarchical  specification  of 
a  system  and  compute  diagnostics  for  each  abstract  hierar¬ 
chical  block,  we  still  preserve  the  directionality  of  causality 
among  the  abstract  blocks.  We  exploit  this  directionality  us¬ 
ing  a  diagnostic  synthesis  algorithm  operating  on  a  directed 
tree. 

Definition  13  AD  -tree  7x>  is  a  directed  graph  with  vertices 
Vt-o  and  a  vertex  Vq,  called  the  root,  with  the  property  that 
for  every  vertex  V  £  Vt®  there  is  a  unique  directed  walk  from 
V0  to  V. 

The  tree-decomposition  results  have  been  generalized  to 
directed  graphs  in  [16],  and  we  make  use  of  some  of  those 
results  here.  The  key  change  is  that  we  need  to  preserve  or¬ 
dering  of  edges  during  the  decomposition  process.  To  capture 
such  properties,  we  first  need  to  define  a  notion  of  variable  or¬ 
dering,  called  Z-normality. 

Definition  14  Let  Q  be  a  digraph  and  let  Z  C  V.  A  set  S  is  Z- 
normal  if  and  only  if  the  vertex-sets  of  the  strong  components 
of  G  \  Z  can  be  numbered  S\,  S2,  ...,  Sd  such  that 

1.  ifl  <  i  <  j  <  d,  then  no  edge  of  G  has  a  head  in  St 
and  tail  in  Sj,  and 

2.  either  S  =  (/)  or  S  =  Si  U  Sj+i  •  •  •  U  Sj  for  some  integers 
i,j  with  1  <  i  <  j  <  d. 


Definition  15  A  D-tree  decomposition  of  a  digraph  Q  = 
(V,£)  is  a  pair  (X,Tjf)  with  Xp  =  (X,  P)  a  D-tree,  and 
X  =  {Xi\i  £  X}  is  a  family  of  subsets  ofV,  one  for  each 
node  ofTp,  and  the  edges  are  numbered  J  =  {1, 1}  with 
T  =  {Fj  :  j  £  J},  such  that 

1.  U*ex  Xi  =  V; 

2.  for  all  edges  {u,  w}  £  £  there  exists  an  i  £  X  with 
v  £  Xi  and  w  £  X^  and 

3.  for  all  i,j,  k  £  X  if  j  is  on  the  path  from  i  to  k  in  Xp>, 
then  Xi  n  Xj~  C  Xj ; 

4-  if  j  £  J ,  then  (J  AXj  :  i  £  X ,  i  >  j}  is  Xj-normal. 

The  width  of  a  tree  decomposition  is  the  least  integer  w  such 
that  for  all?  £  X,  \Xi  U  (J  Xj  \  <w+  1,  where  the  union  is 
taken  over  all  edges  j  £  J  incident  with  i.  max^gx  1*1-1. 
The  treewidth  of  a  graph  Q  is  the  least  integer  w  such  that  Q 
has  a  D-tree-decomposition  of  width  w. 

For  the  class  of  applications  addressed  in  this  article,  the 
input  graphs  Q  for  the  system  description  are  digraphs,  and 
the  decomposition  graph  and  clan  graph  are  both  D-tree  de¬ 
compositions  of  Q.  For  more  general  digraph  topologies,  by 
applying  an  algorithm  for  generating  D-tree  decompositions, 
we  can  convert  the  digraphs  into  a  decomposition  graph,  and 
apply  the  diagnostic  synthesis  approach.  Many  of  the  prop¬ 
erties  of  undirected  tree-decompositions  hold  for  the  directed 
case  [16], 

4.2  DIAGNOSIS  OF  SYSTEMS  WITH 
TREE-STRUCTURED  GRAPHS 

We  now  describe  an  approach  to  diagnosing  systems  with 
tree-structured  decomposition  graphs. 

We  assume  that: 

•  We  are  provided  with  the  component  system  descrip¬ 
tions  and  their  connectivity; 

•  There  is  a  single  root  in  the  decomposition  graph  (which 
is  a  component  with  no  parent-components),  and  each 
leaf  is  a  component  with  no  child-component; 

•  Nodes  have  indices  starting  at  the  root  (X\),  increas¬ 
ing  based  on  a  breadth-first  expansion  from  the  root  and 
ending  at  the  leaves,  labeled  Xn_s, ...,  Xn ; 

•  Each  component  computes  a  local  diagnosis  based  on 
local  observables. 

We  base  our  approach  on  synthesizing  diagnoses,  starting 
from  the  leaf  components  and  ending  up  at  the  root  of  the 
tree.  We  first  decompose  the  decomposition  graph  into  a  clan 
graph.  Based  on  the  clan  graph  we  construct  a  clan  table  for 
each  node  in  the  graph. 

This  algorithm  is  inspired  by  the  Bayesian  network  clique- 
tree  approach  of  [17],  but  replaces  the  clique-tree  with 
an  analogous  clan-tree,  and  passes  diagnoses  as  messages. 
Analogous  to  the  clique-tree  method’s  clique-table  pre- 
computation,  this  approach  requires  pre-computing  clan- 
tables,  but  for  embedded  systems  this  results  in  computation¬ 
ally  simpler  algorithms  than  those  adopted  in  the  past. 

Under  this  scheme,  we  pre-compute  clan  tables  for  each 
clan  in  Qy.  Given  an  observation  9  for  blocks  X , . ....  Xf., 


where  Xi, ...,  Xj-  are  members  of  a  clan  Y  £  Qy,  each  block 
computes  diagnostics  locally.  We  then  compute  the  most 
likely  fault-mode  assignment  for  Y  through  a  process  we  call 
diagnostics  synthesis,  which  entails  table-lookup  in  the  clan 
table  of  the  minimal  diagnosis  given  9.  The  algorithm  synthe¬ 
sizes  final  diagnoses,  going  from  the  leaves  to  the  root.  This 
guarantees  a  sound,  complete  and  globally  minimum  system 
diagnosis. 

In  this  approach  we  first  need  to  pre-compute  the  clan  table, 
and  then  use  that  table  for  diagnostic  synthesis.  We  can  pre¬ 
compute  the  clan  table  from  a  set  of  blocks  ...,  as 
follows: 

1.  Generate  the  decomposition  graph  Gx  from 

...,  4)fc},  with  indices  increasing  in  a  breadth- 
first  manner  from  the  root. 

2.  Generate  the  clan  graph  Gy  of  Gx- 

3.  Compute  the  clan  table  for  each  clan  Y,  in  Gy. 

Given  an  observation  9,  the  diagnostic  synthesis  algorithm 
is  as  follows: 

1.  Given  observation  9,  each  block  B ,  computes  its  local 
diagnosis  D®f9)  and  likelihood  K(£>3>i). 

2.  Mark  all  nodes  i  =  1, ...,  n  with  flag=0; 

3.  Loop  for  j  =  n  to  1 : 

(a)  If  flag=0  for  Xj  do: 

For  each  node  X ,  in  the  clan  Y (Xj),  look  up 

corresponding  clan  diagnosis  D®Y  (9)  and  weight 

k(D®y  ( 9 ))  in  the  clan-table; 

If  «(£>**' (0))  <  K(D*k), 

k-.<t>k&Y 

•  revise  fault-mode  assignment  to  nodes  in  Y  ( Nj ), 
by  (a)  setting  the  minimum-weight  diagnosis 
mode-variable;  (b)  if  any  local  diagnosis  D'  is 
synthesized,  update  D'. 

•  reassign  values  to  variables  in  r  based  on  I)  and 
9 

•  if  reassignment  is  sound  pass  message  with  fault 

Theorem  2  Given  a  tree-structured  decomposition  graph 
Qx  and  local  component  diagnoses,  diagnostics  synthesis 
will  compute  a  sound  and  globally  consistent  set  of  fault 
mode  assignments  for  components  X  £  Qx  within  0(|Y|) 
message-passing  steps,  where  Qy  is  the  clan  graph  generated 
from  Qx- 

Example  2  Diagnosis  Synthesis  in  a  Clan:  Consider  Sce¬ 
nario  3  of  Table  1.  For  this  observation  9,  the  total  set  of 
possible  clan  diagnoses  is:  (Pn,  audio-fail)  A  (P12,  audio- 
fail)  V  (ADB 1,  Xaudio).  The  weights  of  the  diagnoses  are  2 
and  1 ,  respectively. 

In  computing  diagnoses  on  a  purely  local  basis,  the  result¬ 
ing  diagnosis  is  (Pn,  audio-fail)  A  (P12,  audio-fail),  with 
weight  2.  Note  however  there  is  a  family  diagnosis  of  weight 
1,  (ADB  1,  Xaudio),  which  is  selected  since  it  is  of  lower 
weight  than  the  distributed  diagnosis.  We  now  instantiate 
each  local  component  with  9,  and  set  diagnoses  as  follows: 
(Pn,  0),  (P12,  0),  (AD Bi,  Xaudio).  There  exists  a  consistent 
set  of  local  variable  instantantiations  for  this  assignment,  so 
no  further  message-passing  is  necessary. 


Figure  4:  Diagnosis  synthesis  procedure.  Step  1:  (a)  local 
diagnoses  synthesized  at  clans,  and  (b)  clan  diagnoses  are 
passed  between  families,  as  noted  by  dark  arrows. 


Example  3  Message-Passing:  Figure  4  shows  the  first  stage 
of  this  procedure.  In  the  graph  we  show  nodes  where  the  vari¬ 
ables  are  restricted  to  fault  mode  variables,  to  simplify  the 
description  of  message -passing  of  instantations  of  mode  vari¬ 
ables.  First,  the  local  diagnoses  are  computed  at  each  node 
in  the  decomposition  graph:  all  four  passenger  units  register 
a  fault,  and  no  other  nodes  in  the  decomposition  graph  reg¬ 
ister  faults.  As  a  shorthand,  we  denote  a  fault-weight  pair 
using  variable-names  for  faults,  with  0  denoting  a  nominal 
mode.  Then,  these  faults  are  synthesized  at  each  clan  using 
the  clan-table:  fault-weight  pair  (Pn  AP12,  2)  is  synthesized 
into  (ADBi,  1),  and  fault  (P21  A  P22,  2)  is  synthesized  into 
{ADB2,  1).  Second,  the  synthesized  faults  ( ADB\ ,  1)  and 
(ADB2,  1)  are  sent  to  the  adjacent  node  in  the  clan  graph, 
ii. 


Y 
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I 

Tx 

Figure  5:  Diagnosis  synthesis  procedure.  Step  2:  global  diag¬ 
noses  computed  following  family  diagnosis  message-passing. 

Figure  5  shows  the  second  stage  of  this  procedure.  Fault- 
weight  pair  (ADBi  A  ADB2,  2)  is  synthesized  into  (Per,  1) 
at  clan  Yi,  and  all  other  fault-modes  are  set  to  nominal.  This 
is  the  global  minimum-weight  fault. 

4.3  COMPLEXITY  ISSUES 

The  complexity  of  logical  resolution  within  a  distributed 
framework  have  been  discussed  in  [1],  Here,  our  task  is 
model-based  diagnosis  within  a  tree-structured  topology. 

This  approach  is  based  on  computing  diagnoses  for  the 
clans  of  Q.  Hence,  it  never  needs  to  diagnose  a  system  de¬ 
scription  for  the  entire  graph  Q,  but  only  for  the  clans  of  Q . 
As  noted  in  Theorem  2,  once  the  clan  tables  are  computed, 
given  any  local  component  diagnoses,  the  algorithm  is  linear 
in  the  number  of  nodes  in  the  clan-graph. 


The  worst-case  complexity  of  computing  a  clan  table  is  ex¬ 
ponential  in  the  number  of  variables  in  the  clan  table.  The 
memory  requirements  for  storing  the  clan  tables  are  defined 
as  follows.  In  the  worst  case,  for  a  clan  with  mode  vari¬ 
ables  Ai , ...,  Am,  where  each  mode  variable  has  | loa,  |  faulty 
values,  a  clan  table  stores  an  entry  for  each  of  the  x  jjcc/iJ 
multiple-fault  combinations.  For  single-fault  scenarios,  a  clan 
table  must  store  only  JA  | u>At  |  entries. 

The  main  issue  is  the  time-complexity  of  generating  the 
clan  tables.  For  tree-structured  systems  the  complexity  of  di¬ 
agnosing  Q  is  exponential  in  the  clan  size,  and  the  complexity 
is  bounded  by  the  largest  clan  of  Q.  Hence  the  complexity  of 
initially  computing  diagnoses  is  the  same  for  the  centralized 
and  distributed  approaches.  However,  for  embedded  applica¬ 
tions,  the  distributed  approach  has  a  complexity  advantage, 
since  only  clan-table  lookup  and  simple  message-passing  are 
required. 

5  RELATED  WORK 

Our  approach  to  distributed  diagnosis  has  been  preceded  by 
many  pieces  of  related  work,  and  we  review  several  here. 
Note  that  this  review  examines  the  most  relevant  work,  and 
does  not  claim  to  be  exhaustive. 

One  of  the  most  closely-related  pieces  of  work  describes 
techniques  for  distributed  logical  inference  [1;  20].  This  work 
focuses  on  how  to  perform  logical  reasoning  and  query  an¬ 
swering,  proposing  sound  and  complete  message  passing  al¬ 
gorithms,  by  exploiting  the  tree  structure  of  distributed  theo¬ 
ries.  They  examine  the  complexity  of  computation,  propose 
specialized  algorithms  for  first-order  resolution  and  focused 
consequence  finding,  and  propose  algorithms  for  optimally 
partitioning  a  theory  that  is  not  already  distributed.  In  some 
ways,  our  task  can  be  considered  a  special  case  of  the  general 
problem  that  Amir  and  Mcllraith  examine.  Logical  inference 
computes  a  model,  whereas  diagnostic  inference  computes  a 
minimal  model  in  the  assumables,  a  subset  of  the  language 
of  the  theory.  We  leverage  many  aspects  of  the  specific  diag¬ 
nosis  problem  in  our  work,  aspects  that  serve  to  distinguish 
both  our  approach  and  our  results.  These  include  the  notion 
of  causality,  which  imposes  a  directionality  on  the  tree  struc¬ 
ture  and  the  inference,  and  the  notion  of  preference.  In  ad¬ 
dition,  the  task  of  diagnostic  inference  depends  critically  on 
two  classes  of  distinguished  variables,  assumables  (the  liter¬ 
als  of  interest)  and  observables  (the  inputs),  and  distributed 
diagnosability  depends  on  how  assumables  and  observables 
are  distributed  among  the  collection  of  blocks.  In  addition, 
if  the  variables  common  between  two  blocks  are  observable, 
then  from  a  distributed  diagnostics  point  of  view  those  blocks 
are  independent  [7], 

The  approach  presented  here  bears  some  relation  to  diag¬ 
nostic  approaches  on  trees.  Stumptner  and  Wotawa  [25]  have 
an  algorithm  for  diagnosing  tree-structured  systems.  This  ap¬ 
proach  assumes  a  centralized  system  defined  at  the  compo¬ 
nent  level  whereas  our  approach  deals  with  distributed  sys¬ 
tems  that  can  be  defined  at  any  level  of  abstraction.  In  ad¬ 
dition,  our  assumption  of  sub-systems  computing  their  own 
diagnoses  means  that  our  diagnostic  synthesis  process  is  a 
single-pass  algorithm  from  the  leaves  of  the  tree  to  the  root. 


whereas  Stumptner  and  Wotawa  need  a  two-pass  approach 
since  they  must  first  enumerate  all  component  diagnoses.  A 
second  major  tree-based  method  uses  a  clique-tree  decom¬ 
position  of  a  system,  e.g.,  the  diagnostic  method  of  [13].  A 
clique-tree  is  a  representation  that  is  used  for  many  kinds  of 
inference  in  addition  to  diagnosis,  including  probabilistic  in¬ 
ference  and  constraint  satisfaction.  The  tree  we  generate  is  a 
directed  tree  with  a  fixed  root,  and  the  nodes  of  the  tree  are 
generated  based  on  the  clan  property;  a  clique-tree  is  undi¬ 
rected  (with  an  arbitrary  root),  and  the  nodes  of  the  tree  are 
generated  based  on  the  family  property.  One  can  think  of 
the  D-tree  as  a  directed  variant  of  a  clique-tree,  which  is  op¬ 
timized  for  diagnostic  inference.  In  addition,  our  approach 
uses  the  ordering  of  the  D-tree  to  require  message-passing  in 
a  single  direction  only;  in  contrast,  message  propagation  in 
clique  trees  is  bi-directional. 

Our  work  also  bears  some  relation  to  papers  describing  dis¬ 
tributed  solutions  to  Constraint  Satisfaction  Problems  (CSPs) 
[26;  15].  As  with  the  work  on  distributed  logical  inference 

[1],  the  task  of  distributed  CSPs  is  finding  a  satisfying  as¬ 
signment  to  the  variables,  when  constraints  are  distributed  in 
a  collection  of  subsets  of  constraints.  Hence  the  underlying 
tasks  of  distributed  diagnosis  and  CSP  satisfiability  are  dif¬ 
ferent.  One  issue  in  this  work  that  is  similar  to  diagnostic 
reasoning  is  the  recording  of  minimal  sets  of  unsatisfiable 
clauses  as  nogoods  [15].  The  computation  of  nogoods  is  a 
key  step  to  computing  diagnoses  [10]. 

There  have  been  several  proposals  for  using  the  ATMS  [9] 
in  a  distributed  manner,  e.g.,  [11;  19;  3;  18].  Our  approach 
differs  from  this  work  in  that  our  approach  uses  system  topol¬ 
ogy  explicitly,  whereas  these  other  approaches  do  not  make 
as  extensive  a  use  of  topology. 

The  compilation  approach  proposed  in  this  article  bears 
some  relation  to  prior  work. 7  [24]  presents  an  empirical  com¬ 
parison  of  centralized  compilation  techniques  as  applied  to 
several  areas,  of  which  diagnosis  is  one.  Our  future  work  in¬ 
cludes  examining  the  applicability  of  these  compilation  tech¬ 
niques  within  our  distributed  framework.  Compilation  is  also 
examined  in  [20],  but  (as  mentioned  earlier)  as  applied  to  a 
different  task,  logical  resolution. 

There  has  been  some  prior  work  on  distributed  model- 
based  diagnosis.  For  example,  the  approach  in  [14]  assumes 
that  the  diagnosis  computed  by  each  distributed  agent  is  glob¬ 
ally  correct,  and  examine  the  case  where  agents  must  coop¬ 
erate  to  diagnose  components  whose  status  is  unknown.  Our 
approach  makes  the  more  realistic  assumption  that  diagnoses 
are  not  necessarily  globally  sound,  and  derives  a  very  differ¬ 
ent  global  synthesis  algorithm. 

6  SUMMARY  AND  CONCLUSIONS 

This  document  has  described  a  mechanism  for  computing  dis¬ 
tributed  diagnoses  using  system  topology  and  observability 
properties.  This  algorithm  takes  as  input  minimal  diagnoses 
computed  within  distributed  components,  and  uses  system 
topology  to  integrate  these  diagnoses  into  a  globally  sound 
and  minimal  system  diagnosis. 

7  A  review  of  compilation  can  be  found  in  [6], 


We  are  in  the  process  of  applying  this  approach  to  two  real- 
world  domains,  that  of  In-Flight  Entertainment  and  diagnosis 
of  HVAC  systems. 

The  approach  presented  here  provides  a  mechanism  for 
designing  systems  with  predictable  distributed  diagnostics 
properties.  A  given  decomposition  graph  can  be  rated  accord¬ 
ing  to  its  diagnosability  and  efficiency.  Additionally,  given  a 
system  description,  we  can  apply  D-tree  decomposition  al¬ 
gorithms  to  the  system  DAG  to  assist  in  identifying  small- 
treewidth  decompositions,  if  any  exist.  Further,  if  a  system 
has  no  small  treewidth  decomposition,  one  can  then  recom¬ 
mend  system  re-design  to  be  facilitate  efficiently  computing 
distributed  diagnoses. 
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Abstract 

The  growing  importance  of  on-board  diagnosis  for 
automobiles  demands  for  a  close  integration  of  diagnostic 
tasks  in  the  entire  design  process.  This  report  describes 
work  carried  out  to  date  within  the  European  project 
..Integrated  Design  Process  for  onboard  Diagnosis,,  (IDD). 

It  presents  an  analysis  of  the  current  design  process  and  the 
model  of  a  new  process  which  allows  for  a  better 
integration  of  diagnosis  related  tasks,  such  as 
diagnosability  analysis,  failure-modes-and-effects  analysis 
(FMEA),  on-board  diagnosis  design,  in  the  overall  design 
process  of  mechatronic  subsystems.  We  then  discuss  in 
what  way  model-based  technology  can  provide  tools  to 
support  the  actual  integration  and,  in  particular,  present  an 
approach  to  model-based  diagnosability  analysis.. 

Introduction 

The  importance  of  diagnosis  in  onboard  automotive 
systems  is  constantly  growing  together  with  the 
complexity  of  the  systems.  The  average  dimension  of  the 
diagnostic  code  inside  a  modern  electronic  control  unit 
(ECU)  is  now  more  than  50%  of  the  whole  code.  At 
present,  there  is  no  correspondence  between  such  an 
important  role  of  diagnosis  in  onboard  systems  and  a 
similar  role  that  diagnosis  could  play  in  the  design 
process  chain. 

The  correct  way  of  dealing  with  this  situation  is  to  re¬ 
organize  the  design  and  development  chain  so  that  the 
diagnosis  is  no  longer  the  last  task  in  the  design  chain. 

This  goal  provides  an  opportunity  and  challenge  to 
model-based  systems  technology  for  several  reasons. 
First,  in  early  design  stages,  when  physical  prototypes  of 
the  designed  system  are  not  existing,  diagnostic  reasoning 
can  only  be  based  on  a  model.  Second,  since  the  design  is 
subject  to  revisions,  the  adaptation  of  diagnostics  and 
fault  analysis  to  such  revisions  has  to  happen 
automatically  or,  at  least,  without  major  efforts.  Finally, 
the  existence  and  use  of  (simulation)  models  for  the 
development  and  validation  of  control  design  can  provide 


a  basis  for  the  application  model-based  diagnosis 
technology. 

The  European  Fifth  Framework  project  ..Integrated 
Design  Process  for  onboard  Diagnosis1*  (IDD)  pursues  the 
goal  to  formalize  and  standardize  the  diagnostic  design 
process,  and  to  enable  the  introduction  of  diagnosis  early 
in  the  chain.  This  methodological  goal  has  to  be 
combined  with  another  important  objective:  giving  to  the 
designers  a  set  of  model-based  tools  that  can  help  them  in 
evaluating  and  understanding  the  effects  of  each  choice 
on  the  system  being  designed.  The  IDD  project  was 
started  February  2000  with  a  duration  of  three  years  and 
involves  both  industrial  and  academic  partners:  Fiat  CRF 
(Torino),  Magneti-Marelli  SpA  (Torino),  PSA,  Peugeot 
Citroen  (Paris),  Renault  (Paris),  DaimlerChrysler  AG 
(Stuttgart),  OCC’M  Software  GmbH  (Miinchen), 
Universita  di  Torino,  Universite  de  Paris  Nord,  XIII,  and 
Technische  Universitat  Miinchen. 

Except  for  the  approach  to  diagnosability  analysis,  this 
paper  does  not  aim  at  presenting  new  model-based 
theories  or  techniques,  but  rather  focuses  on  describing 
the  work  and  intermediate  results  of  this  project  in  order 
to  increase  the  awareness  of  this  challenge  in  the  field  of 
model-based  reasoning.  Therefore,  we  start  with  a 
description  of  the  current  design  process  and  its 
deficiencies.  Based  on  this,  a  new  design  process  is 
proposed  in  section  3  that  introduces  the  exchange  of 
models  as  the  major  medium  for  a  closer  interaction 
between  control  design  on  the  one  hand  and  failure- 
modes-and-effects  analysis  (FMEA)  and  diagnostic 
design  on  the  other  hand.  Section  4  outlines  the 
technological  and  software  basis  chosen  by  IDD  to 
develop  the  tools  that  are  required  to  realize  this 
integrated  process.  We  then  present  our  approach  to 
model-based  diagnosability  analysis.  Finally,  we  outline 
the  remaining  work  in  the  project  and  list  the  guiding 
applications  which  will  be  used  in  the  project  for 
validation  of  the  tools. 
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Analysis  of  the  Current  Process  of  Design 
and  Generation  of  Diagnostics 

The  current  processes  of  each  industrial  partners  have 
been  investigated  with  a  focus  on  the  integration  of  the 
diagnostic  process  and  diagnosis-related  processes  into 
the  whole  design  process  of  mechatronic  subsystems. 
Starting  from  these  results  a  „merged  process11  has  been 
developed  that  is  based  on  the  similarities  recognized, 
ignoring  details  and  small  differences.  The  abstraction  of 
this  process  will  be  used  as  a  comprehensive  reference  for 
the  current  design  processes.  This  analysis  and  its 
consequences  are  presented  in  more  detail  in  [Brignolo  et. 
al.  01]. 

In  the  framework  presented  here  we  consider  especially 
processes  related  to  mechatronic  subsystems,  such  as  air 
conditioning  or  engine  control  systems.  These  subsystems 
involve  ECUs  as  centers  of  control  and  diagnostic 
functions  and  the  physical  system,  comprising  mechanic, 
hydraulic,  electric  components.  Following  [Bortolazzi- 
Steinhauer  00],  Fig.  1  summarizes  the  overall  design, 
isolating  the  different  phases  and  showing  in  which  way 
the  process  for  a  subsystem,  which  is  the  most  interesting 
one  in  this  project,  is  related  to  the  entire  process. 

Entire 

Process 

Subsystems 

Hardware 

Development 


Software 

Development 


Figure  1  Entire  Process  and  subsystem  process, 
overview 

During  the  ,  strategy  phase1  a  first  conceptual  framework 
for  the  new  product  is  worked  out,  the  .technology  phase1 
targets  the  concept  approval,  the  .integration  phase1 
focuses  on  the  realization  of  the  new  product  by  taking 
into  consideration  technical  feasibility  and  manufacturing 
aspects,  and,  finally,  the  .production  phase1  ensures  the 
industrial  mass  production  with  the  correct  requirements 
of  quality. 

The  IDD  approach  focuses  primarily  on  the  Technology 
phase  which  leads  to  the  first  almost  complete  prototype, 
but  takes  into  account  that  a  good  amount  of  diagnostic 
development  is  performed  at  present  in  the  Integration 
phase,  as  illustrated  in  Figure  1. 


From  an  abstract  point  of  view,  the  reference  process, 
which  is  focussed  on  the  functional  prototyping  within  the 
technology  phase,  can  be  modeled  as  a  set  of  nested 
loops: 

•  Specifications  loop:  Definition  of  requirements, 
specifications  and  implementation  of  the  validated 
result.  In  this  phase  also  feedback  from  after-sales 
and  customers  may  be  involved.  Further 
requirements  may  be  added  depending  on  mock-up 
observations. 

•  Outer  design  loop:  Design  of  the  whole  system 
prototype,  involving  the  definition  of  the  overall 
structure  of  the  system,  i.e.  the  selection  of  the 
physical  (mechanic,  hydraulic,  electric)  components 
and  decisions  about  the  overall  layout  of  the  system. 
This  loop  terminates  when  the  prototype  meets  all  the 
requirements  and  specifications.  The  core  activities 
are  design  of  the  system  including  its  control  and 
diagnosis,  comprising  a  series  of  inner  design  loops, 
and  the  hardware  development  of  the  physical 
system,  which  runs  in  parallel. 

•  Inner  design  loop:  Design  of  the  ECU-based  control 
system  and  components.  Each  iteration  involves  the 
design  of  the  control  algorithms,  FMEA,  diagnostic 
development,  implementation  of  the  ECU  (HW  and 
SW)  and  verification  of  the  algorithms,  as  shown  in 
Figure  2.  The  verification  step  at  the  end  of  the  first 
iterations  is  performed  using  models  (software/ 
hardware  in  the  loop),  whereas,  later,  the  physical 
system  is  used.  Depending  on  the  achieved  results, 
there  are  several  iterations,  each  one  of  them 
producing  an  advanced  prototype. 


Figure  2  The  reference  process, 
one  iteration  of  the  inner  design  loop 

Three  problem  areas  in  the  reference  design  process  have 
been  identified  as  the  essential  ones  with  respect  to  a 
better  integration  of  the  diagnostic  tasks,  mainly  in  the 
inner  and  the  outer  design  loops. 

The  first  problem  concerns  the  interaction  between  the 
diagnosis  design  process  and  the  FMEA  generation  (cf. 
upper  part  of  Figure  2). 


•  FMEA  and  generation  of  onboard  diagnosis  are 
separated  and  sequential  tasks. 

•  Only  few  tools  support  the  information  extraction 
process  needed  for  the  FMEA,  e.g.  simulating  the 
consequences  of  faults  or  studying  interactions 
between  faults.  Thus,  a  lot  of  work  is  left  to  the 
experience  and  sensibility  of  the  people  that  perform 
FMEA. 

The  second  problem  area  concerns  the  interaction 
between  FMEA  and  the  development  of  diagnostics,  and 
the  development  and  design  of  control  algorithms  of  the 
system  (cf.  Figure  2). 

Currently,  these  are  two  substantially  separate  tasks, 
despite  the  fact  that  there  are  important 
interdependencies.  Examples  for  possible  interactions  are: 

•  a  change  of  the  control  algorithm  may  turn  a  physical 
component,  that  was  not  very  essential  before,  into  a 
critical  one  and,  hence  require  additional  diagnostics, 

•  a  change  of  the  control  algorithm  promotes  the 
masking  of  certain  faults  that  were  detectable  more 
easily  before.  Again,  additional  diagnostics  have  to 
take  this  into  account, 

•  a  change  of  the  diagnostics  aiming  at  enhancing 
diagnosability  may  exploit  additional  signals,  which 
may  possibly  improve  control,  as  well. 

As  a  consequence,  requirements  and  constraints  arising 
from  one  of  these  tasks  can  be  dealt  with  by  the  other 
ones  only  in  the  next  inner  design  loop,  i.e.  changes  in  the 
design  of  control  algorithms  can  have  impact  on  FMEA/ 
diagnosis  only  during  the  next  inner  design  loop  and  vice 
versa,  thus  causing  additional  iterations  and  time  delay. 
The  third  problem  area  concerns  the  relation  between  the 
design  of  diagnosis  and  component  selection  and  layout 
definition  fcf.  left-hand  part  of  Figure  2). 

The  problem  here  is,  that  currently  the  component 
selection  task  is  external  to  the  inner  design  loop.  As  a 
consequence,  for  instance  the  choice  or  placement  of 
sensor  is  often  not  optimized  with  respect  to  diagnosis 
purposes,  or,  if  later  changes  are  made,  additional  (outer 
and  inner)  design  loops  are  needed  that  cause  delays. 

An  improvement  could  be  reached  by  performing  a 
comparative  analysis  (,what-if-analysis‘)  inside  the  inner 
design  step  and  the  integration  in  the  early  phases  of 
control  and  diagnostic  development.  Thus,  part  of  the 
component  selection  task  is  moved  inside  the  inner  design 
process,  and,  in  particular  in  the  early  phases  of  the  inner 
design  loop,  it  is  possible  and  cheap  to  modify  component 
choices,  e.g.  sensors,  regarding  type,  sensitivity  or 
placement  and  to  immediately  explore  the  impact  on 
control  generation,  FMEA,  diagnosability  analysis,  and 
diagnosis  generation. 


The  New  Process 

Based  on  this  analysis  of  the  reference  process  and  the 
outlined  improvements,  we  propose  a  frame  for  a  new 
process  which  is  closely  connected  to  a  new  tool 
architecture. 

In  summary,  the  framework  for  a  new  process  has  to 
satisfy  the  requirement  that  in  the  inner  design  loop  of  the 
process,  the  designers  (the  different  experts  involved  in 
the  design)  should  be  supported  in  performing  different 
activities  in  an  interleaved  way: 

•  design  of  the  physical  system, 

•  design  of  control  algorithms,  and  their  simulation  (for 
quantitative  analysis), 

•  generation  of  the  FMEA  of  the  designed  system 

•  analysis  of  the  diagnosability,  i.e.  investigation  which 
faults  are  detectable  and  discriminable  from  each 
other, 

•  derivation  of  on-board  diagnosis  (OBD)  software  for 
the  system, 

•  comparative  analysis  on  the  current  design  (physical 
system  and  control),  i.e.,  analysis  of  the 
consequences  of  applying  changes  to  the  design  both 
from  the  control  and  diagnosability  point  of  view, 

•  comparative  analysis  of  different  design  alternatives. 
Thus,  designers  and  decision  makers  are  supported  in  the 
process  of  evaluating  different  designs  and  in  making 
choices  about  the  best  design  of  a  system. 

•  Such  a  tight  integration  of  different  activities  and  the 
aim  to  perform  them  concurrently  require  the  fast  and 
reliable  exchange  of  information  about  any  changes 
in  the  design  introduced  by  any  of  the  activities.  This 
is  why  we  propose  that  the  model  of  the  system 
being  designed  must  play  a  central  role  in  the  new 
process,  as  indicated  by  Figure  3. 

•  The  aims  to  update  FMEA,  diagnosability  analysis 
and  OBD  generation  quickly  after  a  change  and  to 
consider  different  design  alternatives  in  parallel 
establishes  the  requirement  that  these  tasks  can  be 
effectively  supported  or  automated  by  computer  tools 
based  on  the  model,  i.e.  they  have  to  be  model-based 
tools. 


Figure  3  Frame  for  the  new  design  process 


Software  Support  for  the  New  Process 

Accordingly,  the  actual  goal  is  to  provide  a  new  set  of 
functions  for  supporting  the  designer,  which  are  realized 
as  software  plug-ins  ‘  added  to  the  existing  software  tools 
for  design.  Within  the  scope  of  IDD,  we  are  considering 
three  plug-ins: 

•  tools  for  diagnosability  analysis 

•  tools  for  supporting  the  FMEA  generation  (cf.  [Price 

98] ) 

•  tools  for  supporting  the  generation  of  onboard 
diagnostics  (see  e.g.  [Bidian  et  al.  99],  [Cascio  et  al. 

99] ,  [Sachenbacher-Struss-Weber  00]). 

These  tools  rely  on  model-based  systems  and  will  be 
based  on  a  common  set  of  models  and  a  common  model- 
based  diagnostic  system  core. 

The  new  process  and  the  respective  tools  should  be 
integrated  or  combined  with  the  simulation  tools,  that  are 
currently  used  for  the  design  of  control  strategies  and 
typically  based  on  quantitative  models.  In  IDD,  this  is 
Matlab/Simulink.  This  requires  software  that  transforms 
the  models  created  in  these  environments  into  qualitative 
diagnostic  models  that  form  the  basis  for  the  model-based 
tools. 

Figure  4  summarizes  the  overall  architecture  of  the  new 
design  support  system . 


Figure  4  Tools  architecture  for  the  new  process 

A  challenge  lies  in  providing 

•  a  common  software  platform  with  components  that 
are  re-usable  in  different  contexts,  and 

•  the  harmonization  of  models  used  for  different  tasks. 
The  latter  is  ideally  to  be  achieved  by  automated 
transformation  routines.  In  particular  the  automated 
transfer  of  traditional  quantitative  models  (used  e.g.  for 
simulation  and  control  design)  to  qualitative  models 


allowing  for  automated  FMEA  and  fast,  i.e.  real-time,  on¬ 
board  diagnosis,  is  a  central  target.  If  indeed  successful, 
the  re-use  of  existing  model  fragments  for  different  tasks 
will  reduce  life  cycle  costs  by  a  significant  amount. 

IDD  envisions  three  types  of  application  settings: 

•  an  integrated  toolbox  with  its  own  graphical  user 
interface  and  storage  of  models.  A  component- 
oriented  ontology  has  been  chosen  to  best  address 
modeling  requirements  in  the  automotive  domain. 

•  a  variety  of  plug-ins  to  industry-adopted  existing 
tools.  In  IDD,  we  have  chosen  MatLab/Simulink. 
Models  will  possibly  be  stored  with  these  tools  and  a 
specific  graphical  user  interface  will  be  limited,  if 
existent  at  all.  The  plug-ins  provide  additional 
functionality,  namely  diagnosability  analysis,  FMEA, 
and  the  transformation  of  design  information 
captured  by  the  Matlab/Simulink  model. 

•  the  (on-board)  processing  scenario  for  dedicated 
applications  such  as  diagnosis  and  monitoring.  They 
are  dedicated  to  a  particular  variant  of  a  device.  A 
diagnosis  and  monitoring  application  on  a  ECU  is  a 
typical  example. 

The  IDD  toolbox  and  plug-ins  will  be  running  on 
Microsoft  Windows.  Therefore,  COM  (component  object 
model)  was  chosen  as  a  protocol  for  the  interaction  of 
(binary)  components.  All  the  engines,  transformers,  etc 
are  implemented  obeying  this  standard.  This  allows  for 
the  re-use  of  functionality  in  different  contexts,  and,  in 
particular,  the  three  different  application  settings.  The 
second  cornerstone  is  given  by  the  use  of  XML  (extended 
markup  language)  for  describing  data  in  a  uniform  and 
exchangeable  way.  Many  of  our  software  components 
take  XML  documents  as  input  and  produce  such 
documents  as  output. 

COM  and  XML  allow  us  to  build  task-related 
applications  that  are  constructed  from  components  which 
themselves  are  aggregated  from  even  more  basic 
components.  The  components  in  the  layer  directly  under 
the  application  level  we  call  engines,  our  third 
cornerstone.  So,  there  are  (re-usable  COM)  components 
that  encapsulate  a  diagnosis  engine,  an  FMEA  engine,  a 
predictive  engine,  a  transformation  engine,  etc.  An 
important  consequence  of  the  choice  of  COM,  XML,  and 
engines  is  that  the  resulting  architecture  is  an  open  one, 
open  at  any  desired  degree  down  to  the  level  of  individual 
methods  of  low  level  objects. 

At  the  component  level,  the  IDD  consortium  has  chosen 
OCC’M’s  Raz’r  [RAZ’R  02]  as  a  basis  for 
implementation.  It  provides  state  of  the  art  model-based 
systems  software  packaged  into  COM-components  and 
supplied  with  XML-interfaces.  This  allows  for  further 
extensions  as  needed  by  the  consortium  requirements. 
These  components  include 

•  an  ATMS  (Assumption  Truth  Maintenance  System) 
which  provides  fast  consistency  checking  and 
handling  of  time.  While  still  adhering  to  the  basic 


framework  of  assumption-based  truth  maintenance 
[de  Kleer  86],  the  employed  technology  has  changed 
substantially  making  possible  the  implementation  of 
on-board  systems  meeting  real-time  requirements 
([Sachenbacher-Struss-Weber  00]). 

•  a  constraint-based  predictive  engine  which  allows 
to  limit  the  computational  efforts  by  specifying 
appropriate  foci  of  attention. 

•  a  model  compiler  which  produces  system 
descriptions  (XML  documents)  suitable  for 
processing  by  various  engines.  For  representing 
constraints,  a  data  structure  similar  to  ordered  binary 
decision  diagrams  fOBDD),  but  also  suitable  for 
direct  constraint  processing  is  used  as  a  compact 
representation  [Bryant  92], 

•  a  diagnosis  engine  which  accepts  a  system 
description  and  a  continuous  stream  of  observations 
(measurements)  as  the  input  and  produces  an 
assessment  of  the  current  situation  by  listing  the  best 
candidates  for  diagnosis. 

•  The  model  transformation  engine  is  central  and 
touches  on  still  open  research  questions.  Therefore,  it 
is  a  main  subject  of  the  consortium’s  current 
activities.  As  already  pointed  out,  automated  model 
transformation  is  required  to  obtain  qualitative 
models.  Behavioral  and  structural  descriptions  are 
extracted  from  numerical  models  (developed  in 
Matlab/Simulink),  converted  to  qualitative  models 
represented  in  XML  form  and  possibly  transformed 
into  more  abstract  descriptions  through  a  process 
called  task-dependent  model  abstraction 
([Sachenbacher-Struss  01]).  The  foundations  of  one 
of  the  implementations  and  a  critical  discussion  of 
the  practical  experiences  are  presented  in  [Struss  02]. 

In  the  following,  we  discuss  the  foundations  for  the 
diagnosability  analysis  engine,  that  forms  a  specific 
contribution  of  the  project,  in  a  little  more  detail. 

Diagnosability  Analysis  Engine 

Diagnosability  analysis  is  expected  to  answer  two 
different  types  of  questions: 

“ For  a  particular  design  and  a  chosen  set  of  sensors, 
determine: 

•  Fault  detectability,  i.e.  whether  and  under  which 
circumstances  the  possible  faults  considered  can  be 
detected  (by  the  ECU) 

•  Fault  (class)  discriminability,  i.e.  whether  and 
under  which  circumstances  the  ECU  is  able  to 
distinguish  different  classes  of  faults.” 

The  second  question  is  a  generalization  of  the  fault 
identification  task  (“ Determine  the  present  fault  mode 
unambiguously ”).  This  generalization  is  motivated  by  on¬ 
board  diagnosis  requirements:  full  fault  identification  is 
usually  not  possible  and  also  not  required  for  on-board 
purposes,  since  there  is  a  limited  set  of  possible  recovery 
actions  that  can  be  performed  by  the  control  unit  and 


which  are  to  be  selected  dependent  on  the  general  type  of 
fault  and  its  severity  rather  than  the  individual  fault.  For 
instance,  only  certain  critical  faults  may  require 
immediate  shut-off  of  the  engine  while  others  allow 
continued  operation  possibly  under  certain  limitations. 
Also  off-board  diagnosis  is  appropriately  characterized  as 
fault  class  discrimination  where  the  classes  comprise  the 
faults  of  the  various  smallest  replaceable  units.  More 
generally,  diagnosis  is  usually  a  discrimination  task 
whose  goal  is  defined  by  the  available  “therapy”  actions. 
Discriminability  is  the  fundamental  task,  because 
detectability  can  be  formulated  as  discriminability  from 
the  normal  behavior. 

Although  the  ultimate  goal  is  to  discriminate  classes  of 
behavior  modes  from  each  other,  the  analysis  has  to  based 
on  the  discriminability  of  each  pair  of  individual  faults 
taken  from  any  pair  of  classes,  which  is  unfortunate  from 
a  computational  point  of  view. 

In  our  framework,  (fault)  behavior  modes  are  represented 
as  finite  relations,  and  discriminability  analysis  becomes 
the  task  of  computing  the  observable  distinctions  between 
two  relations.  So,  let  Vobs  be  the  set  of  observable 
variables.  In  an  on-board  situation,  this  corresponds  to  the 
set  of  actuator  and  sensor  signals.  Since  we  want  to 
characterize  the  situations  under  which  detection  or 
discrimination  is  possible,  we  introduce  a  set  of  variables 
Vcause  that  are  exogenous  or  “causal  “  variables  w.r.t.  the 
physical  system  (i.e.  the  subsystem  excluding  the  ECU). 
This  set  includes  the  actuator  signals  but  also  other 
quantities  that  influence  the  behavior  of  the  physical 
system.  Some  of  the  latter  may  be  observables,  e.g.  the 
atmospheric  pressure,  while  other  are  not  (directly) 
measurable,  such  as  the  load.  Since  on-board  diagnosis 
can  rely  only  on  what  is  observable  to  the  ECU,  we 
define: 

V0-cause  “Vobs  t”1  V cause 

and 

V0bs\cause  Vobs  \  Vcause 

as  well  as  the  respective  projections,  PROJobs,  PROJ0.cause. 
The  abstract  example  in  Figure  5  will  provide  an  intuition 
about  possible  answers  to  the  discriminability  question. 
The  vertical  axis  represents  the  observable  causal 
variables  and  the  horizontal  axis  the  remaining 
observables.  There  may  be  many  unobservable  variables, 
but  the  shown  projection  to  the  space  of  observables  is  all 
that  matters. 

Two  different  fault  modes  (  or,  more  generally,  behavior 
modes)  are  represented  by  two  relations.  As  illustrated  by 
the  figure,  we  can  distinguish  three  different  cases: 

•  In  the  upper  section  the  relations  cover  each  other, 
i.e.  for  any  causal  stimulus  in  the  projection  of  this 
intersection  area,  the  observable  set  of  consistent 
tuples  for  the  two  behavior  modes  are  the  same,  and, 
hence,  they  cannot  be  discriminated  from  each 
other. 


•  In  the  lower  section,  they  are  totally  disjoint,  i.e.  any 
of  the  respective  causal  inputs  always  leads  to 
different  system  behavior  and,  thus, 

deterministically  discriminates  between  the  two 
modes. 

•  For  all  other  causal  inputs,  the  two  modes  can 
possibly  be  discriminated,  because  the  actual 
response  of  the  system  may  be  outside  one  of  the 
relations,  but  is  not  guaranteed  to. 


^o-cause 


Figure  5  Three  categories  of  discriminability  of  two 
behavior  modes 

With  this  translation  of  the  task  to  the  analysis  of 
relations,  we  can  also  support  our  previous  claim,  that,  in 
general,  a  pairwise  comparison  of  individual  modes  of 
required  to  determine  the  discriminability  of  classes  of 
modes.  Consider  the  trivial  example  of  one  inverter  with 
two  mode  classes: 

Ci  ={output-stuck-0,  output-stuck- 1 }, 

C2  ={ shorted,  ok}. 

Figure  6  a  and  b  display  the  four  faults  in  the  observable 
space  i,  o,  grouped  in  the  two  classes. 

i  o 


Figure  6  Behavior  classes  of  the  inverter  for  fault 
classes  C,  (a)  and  C2  (b) 


Obviously,  the  faults  are  pairwise  discriminable,  and, 
hence,  so  are  the  two  classes  of  faults.  However,  if  we 
would  try  to  represent  each  class  as  the  disjunction  of  its 
modes  and  associate  with  it  the  union  of  the  respective 
relations,  then  both  of  these  class  relations  cover  the 
entire  behavior  space  and  are  not  distinguishable.  The 
deeper  reason  is  that  a  fault  class  represents  more  than  a 
(exclusive)  disjunction  of  modes.  We  also  make  a 
persistence  assumption,  namely  that  one  particular  mode 
occurs  in  all  inspected  situations  (i.e.  for  all  inputs). 

Before  we  give  formal  definitions  and  computable 
expressions  for  the  concepts,  we  introduce  one  last 
element:  operating  conditions.  This  reflects  the  common 
practice  of  distinguishing  between  ranges  of  internal  or 
external  quantities  that  result  in  qualitatively  different 
behaviors  and  are  often  reflected  by  different  states  of  the 
system  and  its  control.  Examples  are  engine  idle,  clutch 
engaged,  cold  engine,  brake  pedal  pushed. 

Often,  the  analysis  of  fault  effects  and  diagnosability  can 
be  restricted  to  certain  operating  conditions  and  is  futile 
for  others.  For  instance,  one  may  not  be  extremely 
interested  in  the  detectability  of  a  fault  in  the  air  intake 
system  under  conditions  where  the  engine  is  not  running 
(one  has  to  be  cautious  with  such  restrictions,  though, 
because  firstly,  there  may  be  a  requirement  to  perform 
fault  detection  beforehand,  such  as  checking  the 
operability  of  the  airbag  system  or  the  ABS,  and 
secondly,  a  broken  component  could  affect  operating 
modes  in  which  it  is  not  intended  to  be  active). 

In  our  approach,  an  operating  condition  has  to  be 
expressed  as  a  constraint  on  a  subset  of  model  variables. 
Often,  but  not  always,  they  will  refer  to  exogenous 
variables  such  as  the  angle  of  the  accelerator  pedal  or  air 
temperature,  and  typically,  but  not  exclusively,  they  are 
observables  (the  load,  for  instance,  is  not  directly 
observable). 

In  most  cases,  the  constraint  that  defines  an  operating 
condition  will  be  a  conjunction  of  restrictions  on  variable 
values  to  some  interval  or  state  like  temperature>120°C 
or  ignition  =  ON. 

Restricting  the  analysis  to  certain  operating  conditions 
then  boils  down  to  computing  the  intersection  of  a 
behavior  relation  with  their  respective  constraints. 

Definition  1  (Discriminability  of  behavior  modes) 

Let  MODELfaultl,  MODELfault2  be  the  behavior 
relations  of  two  modes, 

OPC;  an  operating  condition, 
and 

SITc  DOM(V0.cause) 

a  non-empty  relation  on  the  observable  causal 
variables. 

For  OPCj  and  SIT,  two  faults  are  called 
not  discriminable,  written 
ND(faulth  fault2,  OPQ,  SIT), 


iff 


(i)  SIT  c  PROJ 

o-cause  (OPCi)  \  PROJ  o-cause 

(PROJobs  (MODEUuiti  n  OPC,)\ 

PROJ0bs  (MODEL|iluii2  n  OPC,) 
u  PROJ0bS  (MODELfeuiG  n  OPC,)\ 

PROJobs  (MODELfaulti  n  OPC,)) 

deterministically  discriminable ,  written 
DD(faultk,  fault2,  OPC,,  SIT), 
iff 

(ii)  SIT  c  PROJ0.caUse(OPCi)  \ 

PROJo-cause  (PROJobs  (MODELfaultl  n  OPC,) 

n  PROJobs  (MODELfaui,2  n  OPC,)) 

possibly  discriminable ,  written 
PD(fault,,  fault2,  OPC, ,  SIT), 
iff 

SIT  CZ  P RO J o-cause(OPCi)  \  (SIT\[,  U  SITdd), 
where  SITND  and  SITDD  are  the  maximal  relations 
that  satisfy  (i)  and  (ii),  respectively. 

These  definitions  characterize  the  three  cases  discussed 
above  w.r.t.  Figure  6  in  a  way  that  can  be  computed  by 
operations  on  the  extensional  constraint  representation 
generated  by  the  model  compiler. 

Based  on  the  discriminability  of  modes,  discriminability 
of  fault  classes  can  be  defined  and  computed. 

Definition  2  (  Discriminability  of  mode  classes) 

Let  FCj  ={ faulty},  j  =1,2  be  two  fault  classes  and 
OPCi  an  operating  condition.  Let  furthermore 
SIT-SET  =  {SITu}  c  P(DOM  (V„)) 
be  a  set  of  non-empty  relations  of  observable  causal 
variables.  FCi,  FC2  are  called 
not  discriminable,  written 
ND(FCi,  FC2,  OPC,) 

iff  there  exists  a  pair  of  modes  that  is  completely  non- 
discriminable: 

3  t'au It jk  e  FCi  3  faulty  e  FC2 

ND(faultlk,  fault  2I,  OPC„  PR0Jo_cause  (OPQ)) 
deterministically  discriminable,  written 
ND(FCb  FC2,  OPC,,  SIT-SET), 
iff  each  pair  of  modes  is  deterministically 
dicriminable  for  some  element  of  SIT-SET: 

V  fault,  ke  FCi  V  faulty  e  FC2  3  SIT^e  SIT-SET 
DD(faultlk,  fault  21.  OPC,,  SITkl) 

Possibly  discriminable,  written 
PD(FC!,  FC2,  OPC,,  SIT-SET), 
otherwise,  iff  all  S  IT  ki  are  in  the  complement  of  the 
non-discriminable  situations: 

Vki  SITU  n  SITND  k,  =  0 

Status  and  Future  Work 

As  of  now,  two  different  alternatives  have  been 
implemented  to  generate  the  qualitative  diagnosis  models 
from  existing  numerical  models  which  both  use  Matlab 
itself  to  compute  the  tuples  of  the  modeling  relation.  In 


addition,  a  library  of  qualitative  models  will  be  created 
manually  that  allows  to  configure  the  model  based  on  the 
structural  description  only.  Based  on  a  use  case  analysis, 
the  core  of  the  diagnosability  analysis  tool  and  the  model- 
based  on-board  diagnosis  engine  have  been  developed. 
IDD  will  use  a  number  of  guiding  applications  with  the 
goal  to  demonstrate  how  the  diagnostic  tasks  described 
can  be  performed  by  using  the  new  process  and  the  new 
tools  architecture.  Furthermore,  we  aim  to  demonstrate 
how  additional  advantages  of  the  new  method  can  be 
achieved,  e.g.  optimization  of  sensor  placement  or  deeper 
diagnostic  performance.  Thereby,  the  guiding  applications 
serve,  on  the  one  hand,  as  case  studies  for  the  application 
of  the  new  techniques  and,  on  the  other  hand,  as  test  cases 
and  demonstrators  of  the  results  of  the  project. 

The  guiding  applications  chosen  cover  on  the  one  hand 
different  mechatronic  systems  with  central  ECU- 
functions,  and  on  the  other  hand  the  general  application  of 
diagnostic  tasks  to  multiplexed  architecture  systems.  They 
include 

•  The  air  delivery  system  for  diesel  engines  (Figure 
7),  comprising  the  exhaust  gas  turbocharging  system 
and  the  exhaust  gas  recirculation  system  (EGR.  and 
the  Common  Rail  Injection  System  (Fiat  and 
Magneti-Marelli). 


Figure  7  Guiding  application:  Air  delivery  system 

•  The  cooling  system  (DaimlerChrysler  AG), 
including  an  intercooler,  which  on  the  one  hand 
increases  the  efficiency  of  the  engine  by  cooling  the 
compressed  air  and,  hence,  increasing  the  air  charge 
rate,  and  on  the  other  hand  decreases  NOx  emissions 
by  keeping  the  combustion  at  lower  temperature 
(Figure  8). 

•  The  air  conditioning  system  (Peugeot  Citroen  PSA) 
which  consists  of  two  loops  that  supply  a  cold  heat 
exchanger  and  a  hot  heat  exchanger  (Figure  9). 
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Figure  8  Guiding  application:  Cooling  system 


Figure  9  Guiding  application:  Air  conditioning  system 

•  The  multiplexed  architecture  (Renault)  involving 
ECUs,  sensors,  actuators,  functions  (EF  =  elementary 
functions),  busses  and  data  frames  (Figure  10).  The 
design  engineer  will  be  enabled  to  run  a  program 
directly  on  the  representation  of  a  designed 
architecture  and  receive  the  results  of  an  analysis  of 
the  interdependency  of  faults  and  functions  in  this 
architecture. 
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Figure  10  Guiding  application:  Multiplexed  architecture 

A  first  version  of  models  for  these  guiding  applications 
has  been  developed  and  will  be  used  to  validate  and 
improve  the  model  abstraction  module  and  to  evaluate  the 
tools.  By  the  end  of  the  project  in  January  2003,  we  hope 
to  demonstrate  the  utility  of  the  tools  and  the  benefits  of 
the  modified  design  process  based  on  examples  that  are 
close  to  reality. 
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Abstract 

A  configuration  knowledge  base  is  software  that  needs 
debugging  during  maintenance  and  can  benefit  from 
consistency-based  diagnosis.  The  paper  describes 
suggestions  and  practical  experience  from  the 
introduction  of  this  diagnosis  technique  in  the  work 
flow  for  maintaining  configuration  knowledge  bases. 
Consistency-based  diagnosis  is  suitable  for  detecting 
bugs  in  knowledge  bases,  but  needs  tailoring  to  fit  in 
the  work  flow  of  the  knowledge  engineers. 

1  Introduction 

Configurators  have  already  been  applied  to  different 
industry  domains.  For  instance,  telecommunication  systems 
are  among  the  products  successfully  handled  with  configu¬ 
rators.  The  crucial  information  is  in  the  knowledge  bases  of 
the  configurators. 

Configurators  using  declarative  constraints  [Mittal  and 
Frayman,  1989]  are  in  everyday  use  and  can  generate  and 
modify  configurations  with  more  than  50,000  objects 
[Fleischanderl  et  ah,  1998].  Declarative  constraints  offer 
easier  maintenance  compared  to  procedural  specifications, 
but  also  benefit  from  effective  debugging  methods.  Con¬ 
sistency-based  diagnosis  [Reiter,  1987]  [Greiner  et  ah, 
1988]  is  applicable  to  fault  detection  in  configuration 
knowledge  bases  [Felfernig  et  al.,  2000],  which  is  the  topic 
of  this  paper.  The  extensions  towards  hierarchical  models 
[Felfernig  et  al.,  2001]  are  not  discussed  here  because  the 
author  did  not  apply  this  yet. 

This  paper  discusses  suggestions  and  practical  experi¬ 
ence  from  applying  diagnosis  techniques  to  the  debugging 
of  declarative  knowledge  bases  for  configurators.  The 
experience  ranges  from  the  planning  of  an  engineering 
process  including  diagnosis  to  the  early  adoption  of  diag¬ 
nosis  for  the  debugging  of  knowledge  bases.  The  require¬ 
ments  of  the  development  process  for  knowledge  bases  are 
compared  with  the  specification  of  the  diagnosis  method. 


2  Maintaining  knowledge  bases 

Creating  and  maintaining  knowledge  bases  is  essentially  a 
software  engineering  process. 

After  collecting  and  analyzing  new  requirements,  the 
knowledge  base  is  modified  and  tested.  Regression  tests  are 
essential  for  long-term  maintenance.  So  the  results  from 
replaying  regression  tests  should  be  fed  into  a  diagnosis 
tool  if  the  new  output  differs  from  the  expected  output  of  a 
regression  test. 

In  an  ideal  world  the  discrepancies  from  regression  tests 
would  be  analyzed  with  a  diagnosis  tool  and  suggestions  be 
made  which  constraints  in  the  knowledge  base  are 
responsible  for  the  discrepancies.  Unfortunately  this  is  not 
that  easy. 

3  Preconditions  for  consistency-based 
diagnosis 

Consistency-based  diagnosis  needs  a  consistency  checker, 
i.e.  a  solver  that  yields  conflict  sets  when  a  knowledge  base 
is  in  contradiction  to  a  positive  example.  The  configurator 
kernel  COCOS  [Stumptner  et  al.,  1998]  applied  by  the 
author  is  a  solver  that  uses  declarative  constraints  for  stati¬ 
cally  checking  or  expanding  a  partial  configuration.  The 
kernel  was  extended  to  also  yield  conflict  sets.  So  a  suffi¬ 
ciently  powerful  consistency  checker  is  available. 

The  elements  that  can  be  faulty  have  to  be  identifiable 
parts  of  a  knowledge  base.  In  our  case  the  constraints  can 
be  faulty  with  respect  to  positive  examples  and  are  the 
“components”  for  model-based  diagnosis. 

4  Requirements  and  consequences  of 
consistency-based  diagnosis 

4.1  Definition  of  a  CKB -diagnosis 

A  CKB-diagnosis  (i.e.  diagnosis  of  configuration  knowl¬ 
edge  bases)  uses  the  model-based  diagnosis  paradigm  and 
is  defined  as  follows  [Felfernig  et  al.,  2000]. 


Definition  (CKB-Diagnosis  Problem):  A  CKB-Diagnosis 
Problem  is  a  triple  (DD,E+,E-)  where  DD  is  a  configura¬ 
tion  knowledge  base,  E+  is  a  set  of  positive  and  E-  of 
negative  configuration  examples.  The  examples  are  given 
as  sets  of  logical  sentences.  It  is  assumed  that  each  example 
on  its  own  does  not  contain  inconsistencies. 

Definition:  A  CKB-diagnosis  for  a  CKB-Diagnosis 
Problem  (DD.E+.E-)  is  a  set  S  c  DD  of  sentences  such  that 
there  exists  an  extension  EX,  where  EX  is  a  set  of  logical 
sentences,  such  that 

DD  -  Su  EX  u  e+  consistent  Ve+  e  E+ 

DD-S  u  EX  u  e-  inconsistent  Ve-  e  E- 

Let  NE  be  the  conjunction  of  all  negated  negative 
examples.  This  is  the  most  easily  found  EX. 

Proposition:  Given  a  CKB-Diagnosis  Problem 

(DD,E+,E-),  a  diagnosis  S  for  (DD.E+.E-)  exists  iff 
Ve-i-  e  E+  :  e+  u  NE  is  consistent. 

Corollary:  S  is  a  diagnosis  iff 

Ve+  e  E+  :  DD  -  Sue+u  NE  is  consistent. 

4.2  Representation  of  examples 

The  definition  of  a  CKB-Diagnosis  Problem  says  that  the 
examples  are  given  as  sets  of  logical  sentences.  This  is 
usually  not  the  case  in  configurator  implementations.  Yet, 
databases  or  other  data  representations  can  easily  be  trans¬ 
formed  into  facts,  i.e.  logical  sentences.  This  transfor¬ 
mation  need  not  be  done  for  the  implementation  of  diag¬ 
nosis  for  configurator  knowledge  bases,  but  is  a  precon¬ 
dition  for  the  applicability  of  CKB-diagnosis. 

With  logical  sentences  one  can  define  a  configuration  as 
a  set  of  fragments.  In  configurator  applications,  configu¬ 
rations  are  based  on  an  object  model,  which  is  usually 
defined  with  UML.  All  objects  usually  are  reachable  from 
one  entry  object.  So  the  positive  or  negative  examples 
cannot  just  be  isolated  sub-configurations,  but  must  be 
connected  objects.  This  is  a  slight  restriction  that  does  not 
limit  the  diagnosis. 

This  property  of  configurations  ensures  that  trivial 
inconsistencies  are  avoided,  e.g.  there  cannot  be  two 
modules  in  the  same  slot.  Therefore  each  example  (i.e.  its 
structure  of  objects  and  connections)  does  not  contain 
inconsistencies  among  its  elements. 

4.3  Conjunction  of  negated  negative  examples 

The  definition  of  a  CKB-diagnosis  requires  an  extension 
EX.  The  question  is:  Where  does  EX  come  from? 

The  simplest  EX  would  be  the  negation  of  all  negative 
examples,  i.e.  NE  as  defined  above.  This  is  not  a  useful 
solution  for  maintaining  configurator  knowledge  bases  in 
real  life.  This  would  reduce  the  advantages  of  declarative 
constraints,  namely  that  knowledge  bases  contain  little 
redundant  information  and  can  be  understood  easily  by 
domain  experts.  Furthermore,  the  constraints  should  be 
sufficiently  general  to  be  applicable  to  similar  situations  in 
the  future.  The  negation  of  configurations  (i.e.  negative 
examples)  would  clutter  the  knowledge  base  with  facts  that 


may  overlap  and  would  not  prevent  examples  that  are 
slightly  different. 

4.4  Diagnosis  is  part  of  the  existing  knowledge 
base 

According  to  the  definition  of  CKB-diagnosis,  a  diagnosis 
is  a  subset  of  the  knowledge  base.  That  means  faults  are 
found  among  the  constraints  in  the  existing  knowledge 
base.  This  is  useful  in  real-life  projects  and  makes  the  con¬ 
sistency-based  diagnosis  worthwhile.  Yet,  defining  new 
constraints  (thus  extending  the  knowledge  base)  has  to  be 
accomplished  with  other  approaches. 

5  Integrating  consistency-based  diagnosis  in 
the  software  engineering  process 

The  definitions  for  consistency-based  diagnosis  of  configu¬ 
ration  knowledge  bases  do  not  tell  a  lot  about  how  to 
proceed  (step  by  step)  to  reach  a  correct  knowledge  base. 
However,  the  conditions  for  the  correctness  check  for 
knowledge  bases  are  specified. 

This  section  describes  how  to  use  diagnosis  in  the  soft¬ 
ware  engineering  process  for  knowledge  bases. 

5.1  Use  the  examples  one  by  one 

Examples,  i.e.  stored  configurations,  may  be  partial  or 
complete.  Due  to  restrictions  coming  from  the  usual  object 
models  in  software  development,  each  example  is  a  net¬ 
work  of  objects  that  can  be  reached  from  an  entry  object. 
Therefore,  only  one  example  can  be  loaded  at  one  time. 
This  holds  for  positive  and  negative  examples. 

5.2  Negative  examples  are  outsiders 

In  the  diagnosis  process  discussed  here,  negative  examples 
do  not  yield  hints  for  mistakes  in  a  knowledge  base. 

We  expect  that  negative  examples  lead  to  inconsis¬ 
tencies.  If  a  negative  example  is  consistent  with  the 
knowledge  base,  the  consistency-based  diagnosis  has  no 
discrepancy  to  start  from.  The  practical  suggestion  then  is 
to  analyze  the  consistent  negative  examples  "by  hand"  and 
modify  the  knowledge  base  to  rule  out  those  examples. 
This  corresponds  to  finding  the  mysterious  EX  in  the  defi¬ 
nition  of  CKB-diagnosis. 

The  good  news,  however,  is  that  negative  examples  usu¬ 
ally  are  modifications  of  positive  examples  or  previously 
positive  examples  that  became  negative  after  a  modifica¬ 
tion  to  the  knowledge  base.  Our  experience  from  mainte¬ 
nance  over  many  years  shows  that  these  negative  examples 
will  mostly  remain  negative  examples  after  more  modifi¬ 
cations  to  the  knowledge  base. 

Help  also  comes  from  good  practice  in  software  engi¬ 
neering.  When  knowledge  bases  are  stored  in  a  version 
control  (configuration  management)  system,  we  can  find 
the  latest  previous  version  where  some  negative  example 
was  still  rejected  by  the  knowledge  base.  Comparing  that 
older  version  with  the  current  knowledge  base  shows  the 
constraints  that  were  modified  or  removed  in  the  meantime. 
This  is  of  course  an  excellent  starting  point  for  modifying 


the  current  knowledge  base  such  that  it  again  rejects  the 
negative  example. 

When  all  negative  examples  are  rejected  by  the  knowl¬ 
edge  base,  start  looking  at  the  positive  examples.  So  the 
negative  examples  are  treated  outside  the  diagnosis  step. 

5.3  Use  the  results  from  regression  tests 

Like  any  software,  knowledge  bases  can  be  maintained 
more  efficiently  by  using  regression  tests  and  checking 
them  after  a  modification. 

When  a  regression  test  produces  an  output  different  from 
its  reference,  find  out  whether  the  new  output  is  expected 
(after  a  modification  to  the  knowledge  base).  Only  if  the 
new  output  is  different  from  what  is  expected,  feed  this 
output  into  diagnosis. 

5.4  Do  diagnosis  and  repeat  the  cycle 

Finally,  we  use  consistency-based  diagnosis  to  detect  faults 
in  the  knowledge  base.  This  follows  the  definition  of  CKB- 
diagnosis  as  described  above.  The  well-defined  precon¬ 
ditions  and  semantics  of  the  method  make  it  particularly 
valuable. 

After  the  knowledge  base  was  modified,  we  must  repeat 
the  cycle  of  testing  and  diagnosis  until  all  negative  exam¬ 
ples  are  inconsistent  and  all  positive  ones  are  consistent. 

The  cycle  described  here  starts  with  the  negative 
examples  (by  modifying  or  extending  the  knowledge  base) 
and  continues  with  the  positive  examples  (by  modifying  or 
reducing  the  knowledge  base).  This  could  be  done  the  other 
way  round.  The  "optimal"  sequence,  however,  depends  on 
the  structure  of  the  knowledge  base  and  the  expert's 
experience  and  point  of  view.  The  objective  is  to  modify 
the  knowledge  base  such  that  it  remains  easy  to  maintain 
and  easy  to  understand.  We  are  confident  that  the  steps 
described  above  help  us  get  close  to  this  objective. 

6  Beyond  diagnosis 

Beyond  the  scope  of  CKB-diagnosis,  other  methods  can  be 
useful  for  maintaining  knowledge  bases. 

Automatic  generation  of  test  cases  would  be  helpful  for 
producing  a  large  set  of  regression  test  cases.  This  would 
assure  the  quality  of  knowledge  bases  that  are  maintained 
over  several  years. 

If  a  negative  example  is  consistent,  automatic  generali¬ 
zation  of  the  negated  negative  example  could  yield  a  non- 
redundant  modification  to  the  knowledge  base.  Here  the 
optimum  between  introducing  too  many  new  constraints 
and  over-generalization  has  to  be  found.  For  this  purpose 
the  methods  for  automatic  learning  of  concepts  have  to  be 
analyzed  with  respect  to  the  semantics  of  the  configuration 
knowledge  base. 

7  Summary  and  conclusion 

Consistency-based  diagnosis  is  applicable  to  the  debugging 
of  configuration  knowledge  bases.  The  method  is  particu¬ 
larly  valuable  because  of  its  well-defined  preconditions  and 
semantics. 


Integrating  CKB-diagnosis  in  the  software  engineering 
process  for  knowledge  bases  can  be  done  efficiently  and 
effectively.  There  are  minor  limitations  where  CKB- 
diagnosis  cannot  be  fully  applied,  i.e.  with  respect  to 
automatic  suggestions  from  negative  examples.  Altogether 
the  experience  from  the  planning  of  a  debugging  process 
with  diagnosis  and  from  the  early  adoption  is  encouraging. 
Results  from  wide  usage  will  follow. 

Acknowledgement 

I  want  to  thank  Gerhard  Friedrich  and  Dietmar  lannach  for 
their  valuable  contributions  to  our  discussions. 

References 

[Felfernig  et  al.,  2000]  Alexander  Felfernig,  Gerhard  E. 
Friedrich,  Dietmar  Jannach,  and  Markus  Stumptner. 

Consistency-based  Diagnosis  of  Configuration  Knowledge 
Bases.  Proceedings  of  the  14th  European  Conference  on 
Artificial  Intelligence  (ECA1-2000),  pp.  146-150,  Berlin, 
Aug.  2000,  IOS  Press. 

[Felfernig  et  al.,  2001]  Alexander  Felfernig,  Gerhard  E. 
Friedrich,  Dietmar  Jannach,  and  Markus  Stumptner. 

Hierarchical  diagnosis  of  large  configurator  knowledge 
bases.  Working  Notes  of  the  12th  Inti.  Workshop  on 
Principles  of  Diagnosis  (DX-2001),  Via  Lattea,  Italy, 
March  2001. 

[Fleischanderl  et  al.,  1998]  Gerhard  Fleischanderl,  Gerhard 
E.  Friedrich,  Alois  Haselbock,  Herwig  Schreiner,  and 
Markus  Stumptner.  Configuring  large  systems  using 
generative  constraint  satisfaction.  IEEE  Intelligent  Systems 
&  their  applications,  13(4):59-68,  July/Aug.  1998. 

[Greiner  et  al.,  1988]  Russell  Greiner,  Barbara  A.  Smith, 
and  Ralph  W.  Wilkerson.  A  Correction  to  the  Algorithm  in 
Reiter's  Theory  of  Diagnosis.  Artificial  Intelligence, 
41(l):79-88,  Nov.  1989. 

[Mittal  and  Frayman,  1989]  Sanjay  Mittal  and  Felix 
Frayman.  Towards  a  generic  model  of  configuration  tasks. 
Proceedings  of  the  11th  Inti.  Joint  Conference  on  Artificial 
Intelligence  (IJCAI-1989),  pp.  1395-1401,  Detroit,  Aug. 
1989,  Morgan  Kaufman  Publishers. 

[Reiter,  1987]  Raymond  Reiter.  A  Theory  of  Diagnosis 
from  First  Principles.  Artificial  Intelligence,  32(l):57-95, 
Apr.  1987. 

[Stumptner  et  al,  1998]  Markus  Stumptner,  Gerhard  E. 
Friedrich,  and  Alois  Haselbock.  Generative  Constraint- 
Based  Configuration  of  Large  Technical  Systems. 
AI-EDAM  (Artificial  Intelligence  for  Engineering,  Design, 
Analysis  and  Manufacturing),  12(4):307-320,  Special  Issue 
on  Configuration,  Sep.  1998. 


Consistency-Based  Fault  Isolation  for  Uncertain  Systems 
with  Applications  to  Quantitative  Dynamic  Models 

Colin  N.  Jones  1  and  Gregory  W.  Bond  2  and  Peter  D.  Lawrence1 2 3 


Abstract.  This  paper  presents  the  Probabilistic  General  Diagnostic 
Engine  (PGDE),  a  novel  method  of  offline  consistency-based  fault 
isolation.  Many  existing  proposals  require  qualitative  logic  mod¬ 
els  for  consistency-based  diagnosis  due  to  their  ability  to  speed  the 
search  for  conflict  sets  through  the  use  of  an  ATMS.  However,  for 
many  applications,  quantitative  dynamic  models  are  preferred  or  al¬ 
ready  available.  The  key  strength  of  the  PGDE  is  that  it  allows  the  use 
of  any  modelling  language  for  which  an  appropriate  calculation  en¬ 
gine  can  be  written.  It  also  offers  graceful  degradation  in  the  presence 
of  uncertainty,  commonly  caused  by  noise  or  modelling  errors.  Fi¬ 
nally,  given  perfect  knowledge,  it  can  be  shown  that  the  PGDE  com¬ 
putes  the  same  result  as  existing  consistency-based  diagnosis  meth¬ 
ods.  To  demonstrate  the  performance  of  the  algorithm,  we  have  used 
a  quantitative  dynamic  model  of  the  fluid  power  circuit  of  a  single¬ 
degree  of  freedom  hydraulic  test  bench  and  developed  an  appropri¬ 
ate  calculation  engine  for  computing  consistency  between  measured 
values  and  predicted  results.  Various  failures  were  generated  on  the 
physical  test  bench  and  the  PGDE  isolated  the  faults  with  approxi¬ 
mately  85%  accuracy. 

1  INTRODUCTION 

Consistency-based  diagnosis  has  at  its  heart  the  search  for  a  subset  of 
the  full  model  such  that  predictions  made  using  the  subset  are  con¬ 
sistent  with  sensor  measurements.  This  search  space  is  exponential 
in  the  number  of  model  components  and  so  a  great  deal  of  attention 
has  been  given  to  developing  efficient  algorithms.  Much  progress  has 
been  made  by  utilizing  the  properties  of  propositional  logic  and  qual¬ 
itative  models  ([10,  8,  1]  to  name  a  few)  but  the  problems  associated 
with  more  complex  dynamic  systems  have  still  to  be  solved  in  gen¬ 
eral.  The  Probabilistic  General  Diagnostic  Engine  (PGDE)  addresses 
some  of  these  issues  in  a  general  framework  that  applies  to  any  model 
for  which  an  appropriate  “consistency  measure”  can  be  formulated. 

There  are  many  devices  for  which  quantitative  dynamic  models 
either  already  exist  or  whose  behavior  can  best  be  described  by  a  set 
of  differential  equations.  The  cost  of  developing  qualitative  models 
exclusively  for  the  purpose  of  diagnosis  is  prohibitive,  thus  making 
the  adaptation  of  qualitative  methods  to  quantitative  dynamic  models 
an  important  topic.  Models  of  this  type  present  two  new  challenges 
to  the  diagnostician:  First,  quantitative  dynamic  models  require  the 
comparison  of  sets  of  signals  to  determine  consistency.  Due  to  noise 
and  modelling  errors,  it  can  be  difficult  to  represent  the  results  of 

1  Cambridge  University,  Engineering  Department,  Trumpington  St.,  Cam¬ 
bridge,  CB2  1PZ,  U.K. 

2  AT&T  Labs  -  Research,  180  Park  Avenue,  Rm.  D273.  Bldg.  103,  Florham 
Park,  NJ,  P.O.  Box  971,  U.S.A 

3  University  of  British  Columbia,  EECE  Department,  2356  Main  Mall  Van¬ 
couver,  BC,  V6T  1Z4,  Canada 


these  comparisons  by  the  discrete  values  typically  used  in  qualitative 
methods.  Second,  the  nature  of  dynamic  systems  is  that  they  often 
have  states  which  are  not  directly  measurable.  When  the  model  is 
simulated  using  only  the  equations  from  a  few  components,  it  is  often 
the  case  that  many  of  the  states  will  become  unknown.  If  no  conflict 
is  observed,  we  reason  that  a  possible  diagnosis  has  been  identified, 
however,  it  is  impossible  to  know  if  there  would  have  been  a  conflict 
if  these  states  had  been  known.  As  a  result,  the  underconstrained  na¬ 
ture  of  dynamic  systems  reduces  the  resolution  of  fault  isolation  pro¬ 
cedures  and  this  must  be  taken  into  account  in  any  diagnostic  method 
dealing  with  these  models. 

The  PGDE  algorithm  attempts  to  deal  with  these  difficulties  by 
maintaining  a  belief  distribution  for  each  possible  diagnosis.  Since 
these  distributions  are  not  limited  to  discrete-valued  consistency 
measures,  the  PGDE  is  able  to  more  accurately  interpret  interme¬ 
diate  non-boolean  consistency  assessments.  They  are  also  updated 
throughout  the  duration  of  the  diagnostic  procedure,  and  conclusions 
about  the  consistency  of  sets  of  components  with  observations  are 
not  drawn  until  sufficient  information  has  been  processed.  In  Sec¬ 
tion  2,  the  proposed  algorithm  is  laid  out  in  a  step-by-step  fashion, 
including  consideration  of  its  computational  complexity  in  Section 
2.5.  Next,  Section  3  presents  a  non-trivial  example  hydraulic  circuit 
and  summarizes  some  diagnostic  results  obtained  by  the  PGDE.  Fi¬ 
nally,  the  paper  closes  with  a  discussion  of  conclusions  and  future 
directions  of  research  in  Section  4. 

2  PGDE  ALGORITHM 

The  model  used  in  a  consistency-based  algorithm  is  a  set  of  con¬ 
straints  on  the  signals  passing  through  the  system.  A  failure  can  be 
declared  when  these  signals  are  inconsistent  with  the  constraints.  The 
goal  of  the  algorithm  is  then  to  locate  a  subset  of  these  constraints, 
which  when  removed  from  the  model,  restore  consistency  between 
the  predicted  and  observed  behavior.  This  process  can  proceed  in  an 
iterative  manner,  selecting  a  set  of  constraints  to  remove  and  simu¬ 
lating  the  system  until  a  feasible  set  is  found. 

We  begin  by  defining  the  system  as  in  [7]: 

Definition  1  A  system  is  a  triple  (SD, COMPS, OBS)  where: 

1.  the  components  (COMPS)  are  a  finite  set  of  constants 

2.  the  system  description  (SD)  is  a  set  of  constraints 

3.  the  obsemations  (OBS)  are  measurements  of  the  physical  device 

There  is  no  requirement  that  there  be  a  one-to-one  mapping 
from  components  to  constraints  and  so  a  partition  { SDc}ceCOMPS 
is  defined  covering  SD  such  that  (J cscomps  SDc  =  SD  and 
SDCi  f]SDCj  =  0  Vci  ^  Cj.  The  set  of  all  possible  failures  is 


given  by  the  power  set  of  COMPS  and  for  each  element  A  C 
P(COMPS),  define  SDA  =  UceA  SDC.  This  allows  the  defi¬ 
nition  of  components  which  contain  large  numbers  of  constraints  or 
complex  behaviors  as  well  as  hierarchies  of  components.  The  cardi¬ 
nality  of  a  set  of  constraints  X  C  SD  is  written  as  |  X\ ;  it  is  a  system- 
dependent  real  number,  representing  the  notion  of  how  “large”  the  set 
X  is  when  compared  to  SD. 

Reiter’s  original  work  [7]  relies  on  a  ‘theorem  prover’, 
TP  (SD,  V(A,  COMPS\  A),  OBS),  which  returns  true  if  the  par¬ 
tial  model  containing  only  the  constraints  in  the  complement  of 
SDA,  ( SDA)C ,  is  consistent  with  the  observations  OBS  and  false 
otherwise;  consistency  implying  that  the  components  A  are  a  possi¬ 
ble  diagnosis.  Here  the  theorem  prover  is  redefined  to  return  a  contin¬ 
uous  measure  of  how  consistent  the  constraints  (SDA)C  are  with  the 
observations  OBS.  It  is  possible  that  the  system  defined  by  ( SDA)C 
with  OBS  as  inputs  may  be  underconstrained.  Thus,  for  some  of  the 
constraints  in  ( SDA)C ,  it  is  impossible  to  verify  if  they  have,  or  have 
not,  been  violated.  If  this  system  is  consistent  then  it  is  not  valid  to 
say  that  A  is  a  diagnosis  as  the  faults  might  have  been  in  the  con¬ 
straints  that  could  not  be  tested.  This  situation  is  very  common  in 
dynamic  systems  with  state  as  they  are  inherently  underconstrained 
[4],  To  deal  with  this,  the  constraints  which  were  used  during  the 
simulation  of  ( SDA)C  are  returned  by  TP(-)  as  defined  below. 

Definition2  Let  A  £  P(COMPS).  Define  the  function  TP  ( ■ .  • )  : 
SD  x  OBS  -^Rx  SD  as: 

(pA,AA)=TP((SDA)c,OBS) 

Where: 

•  pA  €  [0, 1],  1  implies  constraints  ( SDA)C  are  consistent  with 

the  observations  OBS,  and  0  implies  inconsistency 

•  Aa  C  ( SDa)c  are  the  constraints  which  TP(-)  had  sufficient 

information  to  apply  during  the  calculation  of  gtA 

Two  belief  distributions  over  the  states  {true,  false,  unknown }  are 
maintained  for  each  element  A  £  P(COMPS).  These  are  rep¬ 
resented  by  the  probability  mass  functions  Bv  A  (x)  and  Blc  A  (*) 
with  domains  {true,  false, unknown}.  BD  A(true)  is  the  belief 
that  the  evidence,  provided  by  calls  to  TP(-),  shows  that  A  is  a  di¬ 
agnosis.  Bd  A(false)  is  the  belief  that  the  evidence  does  not  show 
that  A  is  a  diagnosis.  It  does  not  mean  that  the  evidence  does  show 
that  A  is  not  a  diagnosis  as  consistency  can  only  incriminate  compo¬ 
nents,  it  cannot  exonerate  them  [7],  Finally,  BD  A( unknown )  is  the 
probability  that  it  is  unknown  what  the  evidence  shows,  or  that  there 
is  no  evidence.  If  pA  =  0  then  at  least  one  component  of  Ac  must 
be  faulty  and  we  call  Ac  a  conflict  set  [7]  and  A  an  inverse  conflict. 
Ac  A  (true)  is  the  belief  that  the  evidence  shows  that  A  is  an  inverse 
conflict,  BIcA(false)  that  it  doesn’t  and  Blc  ^(unknown)  that  the 
evidence  is  unclear. 

Initially,  all  the  beliefs  are  100%  unknown  (BdA(x)  = 
BlC}A(x)  =  {0.0,  0.0,  1.0}).  In  each  iteration,  a  call  is  made  to  TP(-) 
to  check  if  a  new  set  of  constraints  ( SDA)C ,  is  consistent  with  the 
observations,  OBS.  The  distributions  are  then  updated  to  reflect  the 
simulator’s  certainty  in  the  consistency  of  each  set  of  components, 
again  with  the  observations.  In  this  way,  the  diagnostic  engine  deter¬ 
mines  the  components  that  are  most  likely  to  be  faulty,  as  well  as  a 
measure  of  its  confidence  in  these  decisions. 

A  block  diagram  of  the  PGDE  is  shown  in  Figure  1 .  The  following 
sections  deal  with  each  stage  of  the  algorithm  in  detail  in  the  order: 


updating  the  beliefs  (steps  3  and  4),  choosing  a  new  set  to  test  for 
consistency  via  TP(-)  (step  1),  deciding  when  to  stop  and  interpreting 
the  final  belief  distributions  (steps  5  and  6). 

2.1  Belief  update 

Once  a  possible  diagnosis.  A,  has  been  selected.  TP(-)  is  used  to  find 
the  consistency  measure,  p,A,  and  the  constraints  which  were  used  to 
compute  it,  Aa  .  The  goal  is  to  determine  what  the  consistency  mea¬ 
sure  has  shown  about  each  of  the  subsets  of  COMPS,  using  Aa  as 
a  guide.  Assuming  no  fault  models,  two  properties  of  constraint  sys¬ 
tems  allow  the  consistency  measure  of  the  set  A  to  affect  the  beliefs 
of  other  sets:  supersets  of  diagnoses  are  diagnoses  (removing  more 
constraints  will  not  make  the  system  inconsistent)  and  subsets  of  in¬ 
verse  conflicts  are  inverse  conflicts  (adding  constraints  will  not  make 
the  system  consistent).  Using  these  facts,  the  supersets  of  A  are  first 
considered  and  the  information  derived  from  pA  and  Aa  is  used  to 
update  the  beliefs  that  they  are  diagnoses  (BY}Ap(x)  VAp  A  A). 
Similarly,  the  beliefs  that  the  subsets  are  inverse  conflicts  are  also 
updated  (Ac,A<7  (x)  VA c  C  A). 

2.1.1  Update  belief  in  diagnosis 

We  begin  by  assuming  that  pA  =  1.  indicating  that  the  observations 
are  consistent  with  the  constraints  (SDA)C .  The  goal  is  to  determine 
to  what  degree  this  evidence  shows  that  each  set  is  a  diagnosis.  The 
first  step  is  to  locate  the  base  set,  Ap,  for  the  set  ( SDA)C  as  defined 
below  in  Definition  3.  This  is  the  set  with  the  most  components  of 
which  none  have  had  any  of  their  constraints  used  during  the  calcula¬ 
tion  of  p,A.  Referring  to  Figure  2,  in  which  TP((S'D{ii2,3})c,  OBS) 
was  called,  the  base  node  is  A p  =  {1,2, 3,  4}.  If  A  f  Ap,  then  the 
constraints  of  at  least  one  component  have  not  been  considered  due 
to  the  assumption  that  the  components  in  A  were  faulty  (in  Figure 
2  this  would  be  component  3).  In  essence,  TP(-)  cannot  distinguish 
between  any  set  A'  such  that  A  C  A'  C  Ap,  since  whenever  the 
constraints  associated  with  the  components  in  A  are  not  considered, 
neither  are  those  of  Ap,  which  implies  that  pA  =  pAi  =  pA b- 
This  is  a  limitation  of  the  model  and  the  placement  of  the  sensors;  as 
a  result  the  best  the  algorithm  can  do  is  incriminate  Ap  and  inform 
the  user  of  this  sensor  deficiency.  Because  the  consistency  measure 
would  be  the  same  for  all  of  the  sets  A',  such  that  A  C  A'  C  Ap, 
the  sets  are  marked  and  ignored  in  subsequent  calls  to  TP(-).  For  cer¬ 
tain  model  types  these  families  of  sets  can  be  identified  a  priori  and 
grouped  into  single  components  to  speed  the  algorithm  [1,2], 

Definition3  Let  A  C  Ap  C  COMPS.  Then  Ap  is  the  base  set 
for  A  iff 

SDABf)AA  =  0 
VA'  D  Ap,  SDAlf]AAf=<6 

If  the  constraints  associated  with  Ap  are  not  considered  during 
the  call  to  TP(-),  those  in  ( Aa)c\SDAb  are  not  either  (in  Fig¬ 
ure  2  this  would  be  the  unshaded  sections  of  components  5  and 
6).  These  are  the  constraints  which  were  not  considered  that  do 
not  make  up  a  full  component.  The  question  is:  Is  the  lack  of  con¬ 
flict  during  the  computation  of  pA  due  to  the  constraints  in  SDAb, 
those  in  (Aa)c\SDAb,  or  some  combination  of  the  two?  The 
safest  approach  would  be  to  say  that  this  evidence  can  only  increase 
the  belief  that  some  set  A'  D  Ap  which  covers  all  of  (Aa)c  is 
a  diagnosis  (A'  =  {1,  2, 3, 4,  5,  6}  in  the  example).  However,  if 
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Figure  1.  The  PGDE  Algorithm 
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Figure  2.  Example  nine  component  system 


I  (Aa)c  \  SDab  |  -C  \SDab  |,  this  would  be  a  very  conservative  ap¬ 
proach,  in  the  sense  that  a  set  will  never  be  called  a  diagnosis  if  it  can¬ 
not  completely  explain  the  observed  behavior,  and  multiple  compo¬ 
nent  failures  would  be  returned  more  often  than  they  should.  In  most 
cases,  designing  models  which  reduce  the  size  of  {Aa)c  \  SDab 
will  increase  the  precision  of  the  diagnosis  and  so  we  make  the  as¬ 
sumption  that  most  modelers  will  aim  for  this  characteristic  and  as  a 
result  assume  that  |  (Aa)c  \  SDab  |  is  small  compared  to  |  SDab  \  ■ 
Under  the  assumption  that  the  majority  of  the  constraints  which 
were  not  considered  during  the  computation  of  fi a  belong  to  As, 
this  evidence  increases  the  belief  that  As  is  a  diagnosis.  However, 
because  every  superset  of  a  diagnosis  is  a  diagnosis,  this  evidence 
also  increases  the  belief  that  all  of  the  supersets  of  As  are  diagnoses. 
Therefore  for  each  set  As  A  As  the  probability  that  the  constraints 
in  SDap  can  account  for  the  lack  of  conflict  during  the  computation 


of  [aa  is: 

P(Ap  is  a  diagnosis  |  Aa  A  /ta  =  1)  (4) 

K^Arn^As 

Assuming  that  faults  are  equally  likely  to  be  anywhere  in  (Aa)c, 
the  probability  that  they  are  in  SDap  is  given  by  Equation  4,  as  the 
proportion  of  (Aa)c  that  is  covered  by  SDap-  If  all  of,  or  more 
than,  (Aa)c  is  covered,  then  the  probability  that  the  system  will  be 
consistent  is  100%,  by  the  assumption  that  /ta  =  1.0. 

This  probability  is  computed  assuming  /ta  =  1,  when  in  fact  it 
may  well  be  less  than  one.  The  consistency  measure  describes  our 
ability  to  measure  how  consistent  the  observations  are  with  the  con¬ 
straints  A  a-  The  real  components  A  are  either  consistent  or  incon¬ 
sistent  with  observations  and  it  is  only  the  inability  of  the  model  and 
sensors  to  perfectly  determine  which  one  is  true  that  causes  /r a  <  1. 
Therefore  the  consistency  measure  can  be  interpreted  as  a  probabil¬ 
ity  that  the  real  artifact  is  consistent  or  inconsistent  and  we  assume  a 
mapping  VC(fj, a)  to  [0, 1]  defined  by  the  modeler  which  represents 
how  probable  it  is  that  the  real  artifact  is  consistent  given  /ta- 

For  each  Ap  A  Ap  we  define  a  belief  distribution  Bd,ap  (a;;  A) 
over  the  states  {true,  false,  unknown}  which  represents  the  belief 
that  Ap  is  a  diagnosis  given  only  the  information  from  calling  TP(-) 
on  A.  The  distribution  is  defined  as  follows: 

Bd,ap  {true-,  A) 

=  P{Ap  is  a  diagnosis  |  Aa  A  fiA  =  1)  •  PC(/ta) 

Pd,ap  {false-  A) 

=  (1  —  P(Ap  is  a  diagnosis  |  Aa  A  (aa  =  1))  ■  PC(/x a) 

Pd,ap  {unknown-,  A) 

=  1  -  VC{n a)  (5) 

Equation  5  takes  the  probability  that  a  set  is  a  diagnosis  given  Aa  and 
that  the  measure  is  consistent,  and  then  scales  this  probability  by  the 
certainty  that  the  call  to  TP(-)  returned  consistent.  This  distribution 
is  now  combined  with  the  current  beliefs  using  Bayes"  Theorem  and 
the  Total  Probability  Theorem. 


Let  F  be  the  set  {true,  false,  unknown}.  Then  the  current  be¬ 
lief  distribution,  BD  Ap  (x),  is  updated  by  the  evidence  Bd,ap  (x;  A) 
to  the  new  belief  distribution  BfAp  (x): 

Bd,a  p(*)  =  (6) 

E  P(BZ  ap(*)|Bd,ap(/i)=1A 
A./2SF 

BD,AP(f2 ;  A)  =  1)  •  Bd,Ap(/i)  '  Bd,ap(/2;  A) 

The  probabilities  P(S+Ap  (x)  |  Bd,ap(/i)  =  1  A 

Bd,ap  (/2;  A)  =  1)  in  Equation  6  can  be  represented  by  a  condi¬ 
tional  probability  table  as  shown  in  Table  1.  The  first  two  columns 
represent  /i  and  /2  respectively  and  the  last  three  represent  x.  The 
values  in  Table  1  are  chosen  such  that  if  the  current  belief  is  very 
certain,  as  defined  by  the  weight  of  the  unknown  state,  then  a  new 
distribution  which  is  very  uncertain,  will  not  strongly  influence  the 
belief,  and  vice  versa.  If  the  new  evidence  agrees  with  our  current 
belief,  then  this  belief  is  strengthened,  and  if  it  does  not  then  it  is 
weakened. 

Table  1.  Conditional  Probability  Table  used  to  update  If  Ap  ( x )  given 
Bd,ap(x;  A) 

P(B+Ap(x)  |  BDjAp(/i)  =  1ABd,ap(/2;A)  =  1) 


h 
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0.5 

0.5 

0.0 

False 

False 

0.0 

1.0 

0.0 

False 

Unknown 

0.0 

1.0 

0.0 

Unknown 

True 

1.0 

0.0 

0.0 

Unknown 

False 

0.0 

1.0 

0.0 

Unknown 

Unknown 

0.0 

0.0 

1.0 

2.1.2  Update  belief  in  inverse  conflict 

To  update  the  beliefs  BIC  A  (x),  much  the  same  procedure  is  followed 
as  in  the  case  where  the  system  is  consistent,  only  now  the  evidence 
suggests  that  the  considered  sets  are  inverse  conflicts  rather  than  di¬ 
agnoses.  As  before,  the  first  step  is  to  locate  the  set  As,  but  now  it 
is  the  base  set  of  (Aa)c  (Ag  =  {7,  8,  9}  in  Figure  2).  (Ag)c  is  the 
largest  set  of  components  such  that  all  of  (SDab)c  was  used  to  com¬ 
pute  pa  and  we  again  assume  that  |  (SDab)c \  \Aa  \  (SDab  )c|. 
The  evidence  provided  by  /iA  suggests  that  some  of  the  constraints  in 
(. SDab)c  have  been  violated.  Since  adding  constraints  will  not  take 
away  the  fact  that  some  of  these  have  not  been  met,  every  superset 
of  (SDab)c  also  contains  broken  constraints  indicating  that  every 
subset,  Ac,  of  As  is  an  inverse  conflict.  As  before,  the  probability 
that  the  set  Ac  is  an  inverse  conflict  is: 

P(Ac  is  an  inverse  conflict  |  Aa  A  /.la  =  0) 

iAAn(s-PACn 

|aIa| 

We  assume  a  mapping  VXC(pa)  £  [0,  1],  defined  by  the  modeler, 
which  represents  the  probability  that  the  real  artifact  is  inconsistent 
given  pa-  This  mapping  is  then  used  to  compute  a  distribution, 
Pic,ac(x;A),  over  the  states  {true,  false, unknown}  which 
represents  the  belief  that  the  set  Ac  is  an  inverse  conflict  given  only 


the  information  from  calling  TP(-)  on  A. 

Sic, ac  {true;  A) 

=  P(Ac  is  an  inverse  conflict  |  Aa  A  /iA  =  0)  •  PX C(pa) 
Bic,ac  (false-,  A) 

=  (1  —  P(Ac  is  an  inverse  conflict  |  Aa  A  f. ia  =  0)) 

■VXC(p  a) 

B\c,ac  ( unknown ;  A) 

=  1  -  VXC(pa) 

This  belief  distribution  is  incorporated  into  our  current  belief 
BIC  ac  (x)  in  the  same  manner  as  discussed  in  the  previous  sec¬ 
tion.  The  total  probability  theorem  is  again  used  as  in  Equation  6 
to  compute  the  new  belief  distribution  Bf  Ac(x)  from  the  old  one 
B\c,ac  ( x )  and  the  new  evidence  B1c,ac  (x;  A)  using  the  conditional 
probabilities  in  Table  1. 

The  new  evidence  provided  by  the  call  to  TP((S,PA)C,  OBS)  has 
now  been  incorporated  into  the  belief  distributions  Blc  A(x)  and 
Bd  a(x)  for  all  subsets  A  of  COMPS.  The  next  section  looks  at 
how  to  use  these  belief  distributions  to  choose  the  next  component  to 
pass  to  TP(-). 

2.2  Next  best  set 

The  order  in  which  the  subsets  of  COMPS  are  tested  is  crucial  to 
the  speed  at  which  the  algorithm  will  find  the  diagnoses.  There  are, 
however,  several  choices  which  will  produce  varying  results  and  so 
the  choice  depends  largely  on  knowledge  of  the  system.  The  follow¬ 
ing  properties  can  be  taken  into  account  when  developing  a  heuristic 
search  strategy: 

•  Failure  rates:  choose  sets  of  components  with  a  history  of  failure 

•  Expected  knowledge  gain:  choose  sets  of  components  which  are 
expected  to  reduce  the  unknown  portions  of  the  belief  distributions 
the  most.  (i.e.  PD  a  (unknown)  and  BlcA(unknown)).  See  [5] 
for  a  derivation. 

•  Current  belief:  choose  the  supersets  and  subsets  of  the  set  cur¬ 
rently  most  likely  to  be  a  minimal  diagnosis  to  isolate  a  single 
diagnosis  as  quickly  as  possible. 

•  Principle  of  Parsimony:  choose  the  sets  with  the  fewest  compo¬ 
nents  as  they  are  more  likely  to  be  diagnoses. 

•  Execution  time:  choose  the  sets  with  the  most  components,  as 
TP(-)  will  likely  take  less  time  to  evaluate  systems  with  fewer  con¬ 
straints. 

2.3  Stop  conditions 

The  certainties  in  the  potential  diagnoses  returned  by  the  PGDE  in¬ 
crease  monotonically  with  each  iteration  [5],  Thus,  the  maximum 
certainties  are  achieved  when  all  subsets  of  P(COM PS)  have  been 
passed  to  TP(-)  for  testing.  Since  this  is  likely  to  take  too  long,  a  de¬ 
cision  needs  to  be  made  about  when  to  stop.  As  it  is  when  choosing 
a  search  algorithm,  this  decision  is  mostly  heuristic  and  entirely  up 
to  the  modeler.  Some  examples  of  criteria  are  listed  here: 

•  A  time  limit  has  been  reached 

•  The  sum  of  all  of  the  subsets  of  P(COMPS)’s  knowledge  has 
risen  above  some  limit 

•  The  knowledge  gained  per  call  to  TP(-)  has  fallen  below  some 
level 


•  A  percentage  of  the  subsets  of  COMPS  have  been  tested 

•  At  least  one  minimal  diagnosis  has  been  found  with  some  mini¬ 
mum  certainty 

2.4  Most  likely  minimal  diagnoses 

A  minimal  diagnosis  is  a  diagnosis  such  that  no  proper  subset  of  it  is 
also  a  diagnosis.  They  are  of  interest  as  the  Principle  of  Parsimony 
[7]  states  that  the  diagnoses  with  the  fewest  components  are  the  most 
likely.  The  minimal  diagnoses  will  have  the  properties  that  all  of  their 
supersets  will  be  diagnoses  and  all  of  their  proper  subsets  will  be 
inverse  conflicts.  The  goal  is  to  determine  which  sets  are  most  likely 
to  have  these  properties  given  the  belief  distributions  BlcA(x)  and 
Bd,a  (x). 

2.4.1  Combining  BD(x)  and  BIC(x) 

The  two  belief  distributions  BD(x)  and  Blc(x)  have  been  kept  sepa¬ 
rate,  as  they  represent  different  types  of  information.  In  order  to  com¬ 
pute  the  most  likely  minimal  diagnoses,  all  of  the  information  needs 
to  be  taken  into  account  and  as  a  result  they  need  to  be  combined. 
This  is  done  using  the  conditional  probability  table  shown  as  Ta¬ 
ble  2  to  compute  the  combined  belief  distribution  D(x).  DA(true) 
represents  the  probability  that  A  is  a  diagnosis,  while  DA(false) 
represents  the  probability  that  it  is  not.  Note  that  this  is  different 
from  BoA(false )  as  BdA(  false)  represents  the  belief  that  the  ev¬ 
idence  does  not  show  that  A  is  a  diagnosis,  whereas  DA(false) 
represents  the  belief  that  the  evidence  does  show  that  A  is  not  a  di¬ 
agnosis.  DA(unknown ),  represents  the  belief  that  we  don’t  know 
what  the  evidence  shows.  The  values  in  Table  2  are  chosen  such 
that  if  BdA(x)  and  B]c  A (x)  agree  that  A  is  a  diagnosis  and  not 
a  inverse  conflict  then  DA(true)  =  1.  However,  if  they  do  not 
agree,  then  we  are  confused  about  what  the  evidence  has  shown  and 
DA(unknown)  =  1.  If  neither  BD  A(x)  nor  BIC  A(x)  have  any  in¬ 
formation  then  DA(unknown )  =  1. 

Table  2.  Conditional  Probability  Table  used  to  combine  Eb(x)  and 
Bic(x )  into  D(x) 


P(Da(x)  |  BD  A(f i)  =  1  ABIc  i(/2)  =  1) 
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2.4.2  Finding  the  minimal  diagnoses 

Definition  7  below,  defines  a  distribution  DMa(x)  for  each  A  £ 
P(COMPS)  which  represents  the  belief  that  the  set  A  has  the  prop¬ 
erties  of  a  minimal  diagnosis. 

Definition?  Let  A  £  P(COMPS). 

Let  A  a  C  A,  i  s=j  1, . . . ,  m,  Mi  ^  j  A  a  ^  A  c} 


Let  A pt  D  A,  i  =  1, . . . ,  n,  Vi  ^  j  A pi  ^  A p}. 

Define  the  distribution  -<D  (x)  such  that: 

-'D(true)  =  D(false) 

-iD  (false)  =  D(true) 

■^D  (unknown)  =  D  (unknown) 

Define  the  operator  ©  such  that  AQB  equals  the  result  of  combining 
A  and  B  using  the  conditional  probability  table  3,  then: 

DMa(x)  =  Da(x) 

0  DApi  (x)  ©  ...  ©  DApn  (x) 

©  ~'DAci  (x)  ©  ...  ©  -i DAcm  (x) 


Table  3.  Conditional  Probability  Table  used  to  compute  C  =  A©  B 

P(C(x)  |  A(/r)  =  1  A  B(/2)  =  1) 
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The  result  is  that  P>Ma  (x)  is  true  for  sets  which  have  all  proper¬ 
ties  that  a  minimal  diagnosis  should  have  and  false  or  unknown  for 
all  other  sets.  Because  DM(x)  is  a  continuous  distribution  over  the 
states  {true,  false,  unknown},  a  function  is  needed  which  allows 
the  possible  diagnoses  to  be  returned  to  the  diagnostician  in  order 
from  most  likely  to  least,  along  with  a  measure  of  the  algorithm's 
certainty  in  the  result.  The  following  sorting  function  is  suggested  as 
a  good  balance  between  certainty  in  the  result  and  the  belief  that  the 
set  is  a  minimal  diagnosis: 

DMA(true)  ■  (1  —  D MA(unknown))  (8) 

Minimal  diagnoses  can  now  be  returned  to  the  diagnostician  in  or¬ 
der  from  the  one  with  the  largest  value  for  Equation  8  to  the  small¬ 
est.  The  probability  that  a  set  is  a  minimal  diagnosis  is  equal  to 
DMa  (true)/(l  —  DMa  (unknown))  and  the  certainty  in  the  result 
defined  by  1  —  DMa  (unknown). 

2.5  Complexity  considerations 

Calling  TP(-)  on  every  subset  of  COMPS  is  an  exponential  under¬ 
taking.  If  the  PGDE  is  run  so  that  the  maximum  certainty  is  achieved 
in  the  result,  every  subset  of  COMPS  would  need  to  be  tested  and 
the  algorithm  would  indeed  be  exponential  in  time.  However,  a  trade¬ 
off  can  be  made  between  certainty  and  execution  time  by  using  some 
of  the  criteria  listed  in  Section  2.3. 

Maintaining  the  distributions  BD(x)  and  Blc(x)  is  exponential  in 
space  if  the  entire  set  P(COMPS)  is  considered.  However,  for  ex¬ 
ample,  we  assume  that  the  likelihood  of  40  components  failing  simul¬ 
taneously  in  a  system  of  50  components  is  negligible.  Therefore,  the 
algorithm  does  not  require  that  the  distributions  BD(x)  and  Blc(x) 


cover  all  of  P(COMPS),  but  only  up  to  the  level  where  a  reason¬ 
able  number  of  simultaneous  faults  are  considered. 

As  seen  in  Figure  1  there  are  four  steps  to  the  algorithm  which  are 
performed  in  an  iterative  fashion:  choose  next  set,  call  TP(-),  interpret 
the  results  and  update  the  beliefs  BD(x)  and  Blc(x).  This  algorithm 
is  primarily  intended  for  the  diagnosis  of  complex  dynamic  systems 
for  which  TP(-)  will  require  a  period  of  simulation  in  order  to  test  for 
consistency  and  so  it  is  assumed  that  this  call  will  take  a  significant 
period  of  time.  Computing  the  next  set  to  test  can  be  a  function  of 
P(COMPS),  but  it  is  assumed  that  the  TP(-)  will  take  the  majority 
of  the  time.  Both  the  interpretation  of  the  results  and  the  updating 
of  the  belief  states  involve  only  the  supersets  and/or  subsets  of  the 
set  under  test,  which  is  a  relatively  small  number  when  compared  to 
the  size  of  P(COMPS).  The  final  two  steps  of  the  algorithm  do 
involve  the  entire  set  P(COMPS),  but  as  they  are  not  part  of  the 
iterative  procedure,  their  effect  on  the  speed  of  the  algorithm  is  not 
significant. 

3  DIAGNOSIS  OF  A  HYDRAULIC  CIRCUIT 

Figure  4  shows  a  schematic  for  a  single  degree  of  freedom  hydraulic 
manipulator  used  to  test  the  algorithm  presented  in  this  paper.  The 
model  is  made  of  eight  components  as  seen  in  Figure  3:  the  head-side 
port  of  the  main  valve,  the  rod-side  port  of  the  main  valve,  the  cylin¬ 
der,  the  manipulator,  the  rod-side  anti-cavitation  valve,  the  head-side 
anti-cavitation  valve,  the  exit  filter  and  the  check  valve.  The  behavior 
of  the  components  is  described  by  sets  of  hybrid  dynamic  equations 
which  can  be  found  in  [6]  and  [5]. 

The  function  TP((S'Da)c,  OBS)  was  implemented  using  a  mod¬ 
ified  version  of  Flybrid  Concurrent  Constraint  programming,  or  hcc 
[3].  The  set  of  hybrid  dynamic  equations  (SD&)C  is  passed  to  the 
modified  hcc,  along  with  OBS  which  are  the  time  sequences  of  the 
sensor  values.  The  system  made  of  ( SDa)c  and  OBS  will  likely 
be  over-constrained  and  the  resulting  simulation  will  contain  several 
discrepancies  between  measured  and  simulated  values.  These  resid¬ 
uals  (simulated  outputs  less  measured)  will  also  be  time  sequences 
which  can  be  compared  to  a  set  of  residuals  recorded  during  nor¬ 
mal  operation  to  generate  a  consistency  measure,  fi a-  During  the 
experiments,  the  system  was  setup  in  a  position  control  loop  with 
a  sinusoidal  input  signal  at  a  frequency  of  0.25FIz.  A  period  of  six 
seconds  is  recorded,  encompassing  a  single  extension  and  retrac¬ 
tion  of  the  manipulator  arm.  Six  experiments  were  run,  each  with 
the  arm  under  a  different  failure  condition  which  is  common  in  a 
system  such  as  this  [6,  9].  The  failures  were  caused  by  manual  ad¬ 
justment  of  the  three  valves  and  one  friction  plate  shown  in  Figure 
4.  The  faults  are  assumed  to  be  permanent  and  to  have  occurred 
before  the  measurements  are  taken.  At  each  iteration  the  set  to  be 
passed  to  TP(-)  is  selected  to  maximize  the  expected  decrease  in 
U  =  Ea6P(comps)  +  BD  ^(unknown)  and 

the  algorithm  is  stopped  when  the  change  in  U  is  less  than  1%  for 
more  than  10  iterations. 

The  six  failures  and  the  results  of  fault  isolation  using  the  PGDE 
are  as  follows.  On  average,  99.90%  of  the  time  taken  is  spent  in 
simulation  during  the  calls  to  TP(  ),  while  only  0.10%  is  required  for 
the  PGDE  calculations.  For  details  refer  to  [5], 

•  Leak  in  the  hose  connecting  the  valve  to  the  head-side  of  the  cylin¬ 
der. 

This  failure  was  correctly  isolated  in  all  10  sample  runs  taking  an 

average  of  54.5  seconds. 

•  Leak  in  the  hose  connecting  the  valve  to  the  rod-side  of  the  cylin¬ 
der. 


Figure  3.  Component  model  of  the  hydraulic  test  bench 


This  failure  was  correctly  isolated  in  all  10  sample  runs  taking  an 
average  of  53.1  seconds. 

•  Partially  clogged  return  filter. 

For  two  of  the  five  tests  run,  the  filter  was  returned  as  the  most 
likely  diagnosis,  with  the  rod-side  port  of  the  main  valve  and  the 
rod-side  anti-cavitation  valves  together  forming  a  close  second. 
In  the  remaining  three  tests  the  filter  was  not  returned  as  a  diag¬ 
nosis  by  itself,  but  five  diagnoses  containing  the  filter  and  another 
component  were  returned  as  all  being  very  likely.  The  average  cal¬ 
culation  time  was  167  seconds. 

•  Increased  friction  in  manipulator  bearing. 

For  two  of  the  five  tests  run,  the  manipulator  was  returned  as  the 
only  likely  diagnosis  with  very  high  certainty  (96%,  100%).  In 
two  more  of  the  tests  it  was  returned  as  one-half  of  a  double  fault 
and  in  the  fifth  test  the  algorithm  did  not  get  the  correct  solution. 
These  calculations  took  on  average  82  seconds  to  complete. 

•  Leaks  in  both  hoses  connecting  the  valve  to  the  cylinder. 

In  all  five  tests  the  four  double  faults:  {rod-side  anti-cavitation 
valve,  head-side  anti-cavitation  valve},  {rod-side  anti-cavitation 
valve,  head-side  port},  {head-side  anti-cavitation  valve,  rod-side 
port}  and  {head-side  port,  rod-side  port}  were  returned  as  being 
equally  likely  with  a  high  degree  of  certainty  (~  85%).  For  this 
situation,  these  are  the  correct  diagnoses  as  one  component  on  the 
rod-side  and  one  on  the  head-side  that  can  account  for  the  leaks 
is  needed  to  explain  this  failure.  The  average  calculation  time  was 
140  seconds. 

•  Partially  clogged  return  filter  and  a  leak  in  the  head-side  hose. 

In  all  five  tests  the  algorithm  returned  the  head-side  anti-cavitation 
valve  or  port  as  the  only  explanation.  The  filter  causes  a  much 
smaller  effect  on  the  system  and  so  it  is  difficult  to  recognize  it 
as  faulty  when  other  components  are  misbehaving.  The  average 
calculation  time  was  61  seconds. 


Figure  4.  Schematic  of  experimental  test  bench 


4  CONCLUSIONS 

This  paper  has  presented  a  novel  approach  to  consistency-based  di¬ 
agnosis  which  allows  for  the  use  of  any  modelling  language.  The  use 
of  continuous  distributions  representing  the  belief  that  each  set  of 
components  is  a  diagnosis  allows  the  determination  of  consistency 
or  inconsistency  to  be  delayed  until  supporting  evidence  has  been 
collected  and  for  noise  in  the  simulator,  TP(-),  to  be  handled.  The 
demonstration  of  this  algorithm  on  a  non-trivial  physical  test  bench 
shows  that  it  can  be  applied  effectively  to  isolate  realistic  faults  in 
real  artifacts. 
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Abstract.  The  paper  presents  an  approach  suitable  for  on¬ 
line  diagnosis,  which  aims  at  automatically  abstracting  the 
domains  of  discrete  variables  in  the  model  (i.e.  behavioral 
modes  of  system  components)  in  order  to  keep  only  those 
distinctions  that  are  relevant  given  the  available  observations 
and  their  granularity. 

In  particular  the  paper  describes  an  algorithm  which  iden¬ 
tifies  indistinguishable  behavioral  modes  by  taking  into  ac¬ 
count  specific  classes  of  available  observations  and  derives  an 
abstract  model  where  such  modes  are  merged  and  the  domain 
model  is  revised  accordingly. 

By  considering  increasingly  restricted  classes  of  available  ob¬ 
servations  (and/or  granularity  of  observations),  a  set  of  ab¬ 
stract  models  can  be  derived  that  can  be  exploited  through 
model  selection  each  time  a  new  diagnostic  problem  has  to  be 
solved. 

The  approach  has  been  tested  within  the  framework  of  a  di¬ 
agnostic  agent  for  a  space  robotic  arm,  and  experimental  re¬ 
sults  showing  the  reduction  in  the  number  of  diagnoses  are 
reported. 

1  Introduction 

Model  based  diagnosis  has  been  applied  successfully  to  auto¬ 
matic  on-board  diagnosis  problems  in  a  variety  of  domains, 
including  automotive  and  space  missions  ([1],  [10]). 

While  many  problems  are  common  to  off-line  and  on-line 
diagnosis,  the  latter  presents  some  peculiar  challenges,  the 
most  apparent  of  which  concerns  the  tough  constraints  on 
computational  resources  and  time  ([3]). 

Another  difficult  problem  both  on-line  and  off-line  diagnosis 
have  to  deal  with  is  the  potentially  large  number  of  alterna¬ 
tive  diagnoses  returned  by  a  diagnostic  system  when  a  specific 
problem  has  to  be  solved. 

One  classical  way  of  addressing  this  problem  consists  in  defin¬ 
ing  preference  criteria  among  diagnoses,  usually  based  on 
some  form  of  minimality  (see  e.g.  [6])  or  probability,  so  that  a 
number  of  admissible  diagnoses  can  be  discarded  because  of 
their  implausibility  . 

We  can  also  approach  the  problem  not  at  diagnosis  time,  but 
at  earlier  time  (i.e.  during  system  design  and  modeling):  there 


exist  guidelines  for  creating  models  suitable  for  troubleshoot¬ 
ing  (see  e.g.  [8])  as  well  as  methods  for  suggesting  the  place¬ 
ment  of  enough  sensors  in  the  system  to  guarantee  that  only 
one  or  a  few  admissible  diagnoses  will  be  returned  in  each  sit¬ 
uation  (see  for  example  [15]);  sensors  failures  can  be  handled 
by  an  adequate  level  of  redundancy. 

Finally,  the  encoding  of  large  sets  of  diagnoses  in  a  compact 
way  can  at  least  alleviate  the  explosion  of  time  and  space 
required  to  compute  and  handle  such  large  sets  (see  [11]). 

Unfortunately,  all  these  approaches  only  provide  a  partial 
solution;  while  preference  criteria,  cleverly  written  models  and 
compact  encoding  do  not  guarantee  that  the  reduced  set  of 
diagnoses  is  small  enough  in  all  situations,  exhaustive  sensor 
placement  may  be  too  expensive  or  just  impossible  because 
the  device  design  is  already  frozen. 

In  off-line  diagnosis,  there’s  an  additional  possibility:  when 
the  number  of  diagnoses  returned  on  the  basis  of  available 
observations  is  too  high,  further  discriminant  measures  can 
be  automatically  suggested  and  manually  taken  until  a  satis¬ 
factory  level  of  discrimination  is  reached.  Effective  techniques 
based  on  information  theory  and  probability  have  been  de¬ 
vised  to  support  this  process  (e.g.  [7]).  However,  for  on-board 
diagnosis,  this  approach  is  inadequate  since  in  most  cases  the 
only  available  measures  are  provided  by  sensors  and  taking 
further  measures  manually  is  out  of  question. 

In  this  paper  we  present  an  approach  suitable  for  on-board 
diagnosis,  which  aims  at  automatically  abstracting  the  do¬ 
mains  of  discrete  variables  in  the  model  (i.e.  behavioral  modes 
of  system  components)  in  order  to  keep  only  those  distinc¬ 
tions  that  are  relevant  given  the  available  observations  and 
their  granularity.  As  we  shall  see,  this  can  significantly  reduce 
the  number  of  returned  diagnoses. 

The  paper  is  structured  as  follows.  In  section  2  we  in¬ 
troduce  some  definitions,  in  particular  the  notion  of  indis- 
tinguishability  among  the  behavioral  modes  of  a  component. 
In  section  3  we  present  an  algorithm  which  identifies  indis¬ 
tinguishable  behavioral  modes  by  taking  into  account  specific 
classes  of  available  observations  and  derives  an  abstract  model 
where  such  modes  are  merged  and  the  domain  model  is  revised 
accordingly.  Section  4  discusses  some  ways  the  algorithm  can 
be  used  effectively  in  diagnostic  problem  solving. 


In  section  5  we  report  experimental  results  obtained  by  im¬ 
plementing  and  running  the  algorithm  on  the  model  for  a 
space  robotic  arm.  Finally,  in  section  6  we  briefly  review 
other  approaches  in  the  literature  and  underline  similarities 
and  differences  with  respect  to  our  own. 

2  Basic  Definitions 

First,  we  define  a  system  structure  description  (SSD)  by 
slightly  modifying  the  definition  in  [5]: 

Definition  2.1  A  Structured  System  Description  (SSD)  is  a 
tuple  ( V ,  Q,  DT)  where: 

V  is  a  set  of  variables  whose  domains  DOM(v),v  £  V  are 
discrete  and  finite.  Moreover,  variables  in  1 r  are  partitioned  in 
the  following  sorts:  CXT  (inputs),  COMPS  (components), 
STATES  (endogenous  variables),  OBS  (observables)  1 
DT  (Domain  Theory)  is  a  set  of  Horn  clauses  defined  over 
V  representing  the  behavior  of  the  system  (both  normal  and 
faulty).  Note  that  the  clauses  are  constructed  in  such  a  way 
that  the  roles  associated  with  variables  belonging  to  different 
sorts  are  respected:  CXT  and  COMPS  variables  will  always 
appear  in  the  body  of  clauses:  OBS  variables  will  always  ap¬ 
pear  as  heads  of  clauses;  STATES  variables  can  appear  in 
both 

Q  (System  Structure)  is  a  DAG  whose  nodes  are  in  V  repre¬ 
senting  the  structure  of  the  system.  The  graph  can  be  directly 
computed  from  DT,  being  just  a  useful  way  for  making  explicit 
the  structural  properties  “hidden”  in  DT  clauses:  whenever  a 
formula  Ni(bmi)  A  ...  A  Nf(bmk)  =>  M(bmi )  appears  in  DT, 
nodes  N\  through  N &  are  parents  of  M  in  the  graph 

Since  the  system  structure  graph  Q  is  a  DAG,  a  partial  prece¬ 
dence  relation  holds  between  connected  nodes  in  the  graph: 

Definition  2.2  We  denote  with  y  the  usual  precedence  par¬ 
tial  order  relation  over  nodes  in  DAG  Q,  i.e.:  N  >-  M  if  there 
exists  a  directed  path  from  N  to  M . 

Given  an  SSD  we  can  define  specific  diagnostic  problems  over 
it: 

Definition  2.3  A  diagnostic  problem  is  a  tuple  DP  = 
{SDD,  OBS',  CXT)  where  SSD  is  the  System  Structured 
Description,  OBS'  is  an  instantiation  of  OBS'  C  OBS  and 
CXT  is  a  complete  instantiation  of  CXT 

We  are  now  ready  to  give  our  definition  of  diagnosis,  which  is 
a  fully  abductive  characterization  2  (see  [4]): 

Definition  2.4  Given  a  diagnostic  problem  DP  = 
{SDD,  OBS',  CXT)  an  assignment 

H  =  {ci  (toil), . . . , cn(bmn)}  of  a  behavioral  mode  to  each 
component  c ,•  £  COMPS  is  a  diagnosis  for  DP  if  and  only 

if: 

Vm(i)  £  OBS'  DT  U  CXT  U  H  h  infix) 

and 

Vm(i)  £  OBS'  DT  U  CXT  U  H  If  m{y)  for  y  /  x 

1  We  assume  that  observables  never  influence  other  variables.  This 
is  not  restrictive:  each  observable  parameter  which  influences 
other  variables  is  modeled  as  an  endogenous  variable  (i.e.  it  be¬ 
longs  to  STATES)  with  an  associated  observable  in  OBS 

2  Note  however  that  our  approach  does  not  depend  on  the  definition 
of  diagnosis  being  abductive  vs  consistency-based 


Since  in  our  definition,  OBS'  is  (in  general)  a  partial  instan¬ 
tiation  of  OBS,  we  can  introduce  the  notion  of  diagnoses  that 
can’t  be  discriminated  given  OBS'  but  that  may  be  discrim¬ 
inated  if  more  observables  were  available: 

Definition  2.5  Given  a  diagnostic  problem  DP  = 
{SDD,  OBS',  CXT),  let  us  suppose  that  HI  and  H2 
are  two  diagnoses  for  DP.  HI  and  H 2  are  discriminate  if 
and  only  if  3  m  \  m  £  ( OBS  —  OBS')  such  that 
DT  U  CXT  U  HI  h  m.{a) 

DT  U  CXT  UH‘2\-  m(b ) 
m.(a)  :/  m.(&) 

Diagnoses  are  complete  instantiations  of  variables  in  sort 
COMPS.  We  now'  turn  into  considering  two  such  assignments 
.41  and  .42  and  compute  the  projections  3  of  their  transitive 
closures  4  over  OBS  (OBS1  =  projectoBs(tclosure(Al)) 
and  OBS2  =  projectoBs(tclosure(A2))  respectively),  given 
a  fixed  context  CXT. 

If  OBS1  =  OBS2  then  A1  and  .42  are  indiscriminable  di¬ 
agnoses  for  diagnostic  problem  {SDD,  OBS,  CXT)  where 
OBS  =  OBS1  (and  =  OBS2).  An  interesting  relation  be¬ 
tween  .41  and  .42  holds  when  this  situation  happens  under 
any  fixed  context  CXT: 

Definition  2.6  Let  .41  and  .42  be  two  complete  in¬ 
stantiations  of  COMPS;  if,  given  any  context  CXT, 
projectoBS  (tclosure(Al))  =  projectoBS  (tclosure(A2)) ,  then 
we  say  that  A1  and  A2  are  indiscriminable. 

In  the  above  definition  we  have  considered  the  case  w'here 
all  OBS  are  available.  Let’s  now'  consider  the  case  (as  it  is 
usual  in  on-board  diagnosis)  when  we  can  identify  subsets  of 
OBS  that  may  be  the  only  available  manifestations  (e.g.  only 
sensorized  manifestations  may  be  available  on-board,  with  no 
possibility  to  perform  further  measurements). 

Let  {CLfc}  denote  such  identified  interesting  subsets  (not  nec¬ 
essarily  all  disjoint);  wre  can  now  refine  definition  2.6  as  fol¬ 
lows: 

Definition  2.7  Two  assignments  A1  and  A2  are  CLk- 
indiscriminable  iff  they  are  indiscriminable  by  considering 
OBS  restricted  to  CLk,  i.e.  VCXT  projectcLk  (tclosure(Al)) 
=  projectcLk  (tclosure(A2)) 

Given  the  above  definitions,  wre  are  now'  ready  to  characterize 
formally  two  behavioral  modes  (i.e.  values  from  the  domain  of 
a  component  variable  c,)  that  may  be  safely  collapsed  together 
w'ithout  loosing  any  discriminability  power  of  the  model: 

Definition  2.8  Let  bm,r  and  bm.s  be  two  behavioral  modes 
of  component  variable  a;  if  for  any  two  assignments  A1  = 
(al  A  Ci{bmr)  A  a2)  and  A 2  =  (al  A  Ci(bms)  A  a2)  such  that 
they  differ  only  in  the  mode  associated  to  Ci,  .41  and  .42  are 
(C Lk- (indiscriminable,  then  we  say  that  bmr  and  bms  are 
(CLk  - ) indistinguishable . 


3  A  projection  of  a  set  of  instantiated  variables  I  over  a  set  of 
variables  W  ( projectwD ))  is  just  the  subset  of  I  that  mentions 
variables  in  W 

4  The  transitive  closure  of  Ai  (tclosure(.4i))  is  the  set  of  m(x )  s.t. 
DT  U  CXT  U  Ai  t-  m(x) 


3  Automatic  Domain  Abstraction 
3.1  The  Algorithm 

In  this  section  we  present  an  algorithm  which  identifies  indis¬ 
tinguishable  modes  in  a  given  model  (that  we  will  refer  to  as 
the  detailed  model),  and  generates  a  simplified  model  (that 
we  will  call  abstract)  where  mutually  indistinguishable  modes 
are  merged  in  new  modes.  The  algorithm  assumes  that  the 
model  is  defined  as  in  definition  2.1  and  further  assumes  that 
in  the  system  structure  graph  Q  at  most  one  directed  path 
exists  between  any  two  nodes. 

The  top  level  function  Abstract ()  is  sketched  as  pseudo¬ 
code  in  figure  1  while  other  relevant  functions  called  by 
Abstract  ()  are  showed  in  figure  2. 

Parameter  CLk  C  Obs  of  Abstract!)  contains  the  list 
of  available  manifestations,  while  II cLk  associates  to  each 
M  G  CLk  its  granularity  in  the  form  of  a  partition  IIm  over 
DOM(M). 

Manifestations  that  aren’t  available  at  all  do  not  belong  to 
CLk..  If  M  is  available  at  a  certain  level  of  granularity,  II m 
will  contain  as  many  classes  as  the  distinguishable  values  for 

M,  and  each  class  will  contain  all  the  v  G  DOM(M)  that 
can’t  be  distinguished  at  the  available  level  of  granularity.  As 
a  special  case,  if  M  is  available  at  its  maximum  granularity, 
IIm  will  contain  a  separate  class  for  each  v  G  DOM(M). 

The  first  few  instructions  of  Abstract!)  perform  an  initial 
abstraction  of  the  model  based  on  II cLk'-  indistinguishable 
values  for  each  manifestation  M  (i.e.  those  that  belong  to  the 
same  class  in  IIm  )  are  substituted  in  DT  by  a  new  “abstract” 
value  representing  the  whole  class. 

The  call  to  TopologicalSort ()  returns  a  list  contain¬ 
ing  variables  in  Comps  U  States  such  that  if  two  variables 

N,  M  satisfy  relation  2.2  (i.e.  N  y  M)  we  guarantee  that 
position(N)  >  position(M).  In  particular,  we  start  a  visit 
of  the  system  structure  graph  Q  at  the  available  observation 
nodes  and  proceed  backwards  by  visiting  a  node  only  if  all  its 
immediate  successors  have  already  been  visited. 

Note  that,  by  starting  the  visit  at  the  available  manifesta¬ 
tions  only  (i.e.  CLk),  some  of  the  Comps  and  States  may  not 
be  reached  at  all;  these  nodes,  that  are  connected  only  to 
unavailable  manifestations,  are  stored  in  a  TrivialNodes  list 
(see  below). 

The  main  loop  in  Abstract!),  for  each  variable  N  in 
the  list,  first  computes  the  conditions  under  which  the 
variable  influences  its  immediate  successors  modes  (this  is 
recorded  in  an  associative  memory  Inf  luencesMatrix  [] ); 
then,  by  using  Inf  luencesMatrix  []  it  computes  the  par¬ 
tition  of  all  the  modes  of  the  variable  in  equivalence 
classes  determined  by  the  indistinguishability  relation 
(FindlndistinguishableModes  () );  finally  it  replaces  the  oc¬ 
curences  of  the  modes  in  the  DT  clauses  with  newly  introduced 
“class  representative”  modes  (MergeModes  () ). 

If  the  call  to  FindlndistinguishableModes!)  produced  a 
trivial  partition  for  N  (i.e.  only  one  class  coinciding  with 
DOM(N))  then  N  itself  is  added  to  the  list  of  trivial  nodes. 

When  Abstract!)  terminates,  TrivialNodes  contains  the 
components  and  states  whose  behavioral  modes  are  all  equiv¬ 
alent  in  influencing  relevant  manifestations  (i.e.  M  G  CLk). 
These  nodes,  together  with  unavailable  manifestations  (i.e. 
M  G  Obs\CLi.)  are  obviously  redundant  for  the  diagnostic 
task  and  the  caller  of  Abstract!)  may  decide  to  completely 


remove  them  from  the  model. 

Let’s  now  describe  into  some  more  detail  the  functions 
called  by  Abstract  ()  (figure  2). 

Function  Findlnf luences ()  considers  how  each  mode  bmr 
of  variable  N  under  consideration  can  cause  mode  bm,s  of 
immediate  successor  variable  M.  The  condition  under  which 
N(bm,r)  causes  M(bm.s)  is  clearly  the  disjunction  of  conjunc¬ 
tions  of  the  form  a  =  ai  A  ay  where  ai  and  a;  occur  in  a 
formula  a i  A  N(bm.r)  A  ay  ^  M(bm.s). 

Function  FindlndistinguishableModes!)  is  recursive;  at 
each  call  it  partitions  a  set  of  modes  into  indistinguishabil¬ 
ity  classes  based  on  a  single  immediate  successor  node  and 
then  calls  itself  recursively  on  each  of  the  generated  equiva¬ 
lence  classes  in  order  to  further  discriminate  by  considering 
the  remaining  immediate  successors. 

Note  that  in  the  test  ({a,  tt)  G  ncoll£;)  we  are  testing  proposi¬ 
tional  formulas  for  identity;  we  assume  that  any  two  equiva¬ 
lent  formulas  have  been  made  identical  at  that  point  by  calls 
to  normalize()  in  Findlnf  luences  () .  Normalization  is  not  too 
computationally  expensive  since  the  formulas  we  handle  are 
in  DNF  and  only  positive  literals  can  occur. 

Function  MergeModes!),  given  a  partition  II 
(either  an  element  of  II  cLk  or  computed  by 
FindlndistinguishableModes!)),  considers  the  equiva¬ 
lence  classes  tt  one  at  a  time.  It  generates  a  new  name  v  as 
a  “representative”  for  the  class  and  then  scans  the  DT  set  of 
formulas  for  occurrences  of  bm  G  n  and  replaces  them  with 
v.  This  process  can  produce  duplicate  formulas  u;  by  using 
set  notation  in  the  pseudo-code  we  underline  that  only  one 
copy  of  the  duplicate  formulas  has  to  be  added  to  the  new 
version  of  DT. 

3.2  Correctness 

In  this  paragraph  we  state  two  properties  which  imply  that 
the  abstraction  algorithm  behaves  as  intended. 

Property  3.1  If  two  behavioral  modes  are  put  in  the  same 
class  7 r  by  function  FindlndistinguishableModes!)  they  are 
CLk-indistinguishable  in  the  sense  of  definition  2.8. 

Proof.  Given  assignments  Al  =  a i  U  {N(bmri)}  U  02  and 
.42  =  qi  U  {N(bmr2)}  U  ay  suppose  DT  U  CXT  U  Al  h  mfx) 
while  DT  U  CXT  U  A2  \f  m(x)  for  some  m  G  CLk-  Clearly, 
it  can’t  be  m(x)  G  t.closure(al  U  a2)  because  otherwise  mfx) 
would  be  derivable  from  A2  as  well. 

Then,  the  entailment  of  mfx)  by  .41  must  exploit  at  some 
point  N(bm.ri)  by  using  a  formula  tp  =  {N(bmr i)  A  7  =>  L). 
If  L  =  mfx),  i.e.  the  formula  directly  entails  mfx),  then,  an 
analogous  formula  tp'  =  (N(bm,r2)  A  7'  =>-  mfx))  must  exist 
in  DT,  with  7  7'  (indeed,  two  modes  are  put  in  the  same 

partition  only  if  they  have  the  same  direct  effects  under  the 
same  conditions).  Then,  DT  U  CXT  U  .42  h  mfx),  which  is  a 
contradiction. 

This  result  can  be  extended  to  the  case  when  L  /  mf  x)  (i.e. 
the  number  of  steps  between  the  application  of  formula  tp 
and  the  conclusion  mfx)  is  greater  than  1)  with  a  proof  by 
induction.  □ 


0  This  is  not  incidental:  the  value  of  our  abstraction  partially  lies 
in  the  collapse  of  formulas 


Function  Abstract  (V  =  {  Cxt ,  Comps,  States,  Obs  },  G,  DT,  CLk,  IIcLj.) 

ForEach  M  £  CLk 

DT  :=  MergeModes  (M ,  UCLk(M),  DT) 

Loop 

Candidates  :=  TopologicalSort  (States  U  Comps,  CLk,  G) 

TrivialNodes  :=  States  U  Comps  \  Candidates 
Inf luenceMatrix  :=  0X0X0 
ForEach  (N  £  Candidates) 

ImmediateSuccessors  :=  {children  of  N  in  the  system  structure  graph  G}  fl  (Canditates  U  CLk ) 
Inf luenceMatrix  :=  Inf luenceMatrix  U  Findlnf luences (N ,  ImmediateSuccessors) 
n  :=  FindlndistinguishableModes (N ,  modes(N ) ,  ImmediateSuccessors,  Inf luenceMatrix) 

If  (II  =  {DOM(N)})  Then  TrivialNodes  :=  TrivialNodes  U  { N } 

DT  :=  MergeModes  (iV,  II,  DT) 

Loop 

Return 

EndFunction 


Figure  1.  Sketch  of  the  Abstract  O  function 


The  following  property  is  intended  to  demonstrate  the 
correspondence  of  a  diagnosis  at  the  abstract  level  to  a  set  of 
diagnoses  at  the  detailed  level. 

Property  3.2  Let  DPci  =  (SSDd,  OBS',  CXT)  be  a  di¬ 
agnostic  problem  and  DPa  =  {SSDa ,  OBS',  CXT)  the 
corresponding  problem  at  the  abstract  level.  Then,  Da  = 
{ci  (v\ Cn  (v„)}  where  Vi  is  a  new  behavioral  mode  in¬ 
troduced  in  place  of  set  {bmn, . . . ,  bm.jk,  }  of  indistinguishable 
behavioral  modes  is  a  diagnosis  for  DPa  iff  all  the  elements 
m  the  set: 

i'l'-i  (bin,  ). .  .  ,J ,cn(bmnjn  )},  ji  =  1  . . .  ki} 
are  diagnoses  for  DPd- 

Proof.  Our  proof  is  subdivided  in  3  steps:  first,  we  prove 
that  for  any  two  diagnoses  at  the  detailed  level  Da i  and  Dd2, 
projectoBS'  ( tclosure(Ddi ))  =  projectoBS'  ( t.closure(Dd2 )), 
where  OBS'  C  OBS  represents  the  available  manifestations 
(parameter  Obs  of  function  Abstract  () ).  Then,  we  prove  that 
for  any  detailed  diagnosis  Dd ,  projectoBS’  {tclosure(Dd))  = 
project0BS'{tclosure(Da)).  Finally,  we  exploit  this  result  to 
prove  the  theorem  thesis. 

In  the  following,  projectoBS1  (tclosure(.))  has  been  abbrevi¬ 
ated  in  tc0 bs1  {•)• 

For  step  1,  we  proceed  by  induction  on  the  number  of 
components  which  are  assigned  different  behavioral  modes 
in  assignments  Ddi  and  Dd 2-  The  case  n  =  1  (i.e.  Dd i  = 
ali{ci(bmr)}  and  Dd2  =  a  U  {c,:(6m.s)})  follows  from  the  def¬ 
inition  of  indistinguishability  of  6m,.  and  bm,s . 

For  the  inductive  step,  where  Ddi  =  a  1  U  {c,(6m.r)},  Ddi  = 
a2  U  {c,(6ms)}  and  al,a2  differ  in  assignments  to  n  compo¬ 
nents,  we  note  the  following  relations  hold: 
tCoBS'{al  U  a{bmr ))  =  tcOBs'(al  U  c,(6ms)) 
from  indistinguishability  of  bm.r  and  bms,  and: 
tco bs1  (ol  U  Ci(bmB))  =  tcoBS1  (ct2  U  a(bms)) 
from  inductive  hypothesis.  It  then  follows  that  tcOBSi(al  U 
Ci(bmr))  =  t.coBS'  (o2  U  c,:(6ms)). 

In  order  to  carry  step  2,  we  note  that,  since  v,  is  substituted 
by  MergeModes  ()  wherever  a  mode  bmiji  of  its  associated  class 


7r  appears,  the  following  holds: 

tCQBS'(Da)  =  Ujj . j„  tCoBS1  ({ci(bmij1), . .  .f£„(bmnjn)}) 

But,  in  step  1,  we  have  proved  that  all  the  terms  of  the  union 
are  equal.  So,  tcOBs'  (Da)  is  equal  to  the  tcOBS'  of  any  of  the 
Dd. 

We  use  this  result  for  step  3:  Da  is  a  diagnosis  with  the 
abstracted  model  iff  DT  U  CXT  U  Da  h  OBS';  but,  then, 
for  any  Dd  the  same  entailment  must  hold,  thus  any  Dd  is  a 
diagnosis  at  the  detailed  level.  The  converse  is  analogous.  □ 


3.3  An  Example 

We  end  this  section  by  illustrating  how  the  abstraction  algo¬ 
rithm  works  on  a  very  simple  SSD.  Let  the  original  Domain 
Theory  DT  contain  the  following  clauses  (figure  3  shows  the 
associated  System  Structure  Graph): 


sl(a)  A  s2(a)  =>  ml(x) 
sl(a)  A  s2(b)  =>  m,l(x) 
sl(a)  A  s2(c)  ml(x) 
sl(6)  A  s2(a)  =>  m,l(y) 
sl(b)  A  s2(b)  =>  ml(y) 

■sl(6)  A  s2(c)  =>  m,l{y) 

il(a)  A  cl(a)  A  c2(a)  .sl(a) 
11(a)  A  cl(a)  A  c2(b)  =>  sl(a) 
il(a)  A  cl  (a)  A  c2(c)  =>  si  (6) 
il(a)  A  cl(6)  A  c2(a)  =>•  .sl(a) 
*l(a)  A  cl(6)  A  c2(6)  =>  .sl(a) 
11(a)  A  cl(6)  A  c2(c)  sl(6) 


sl(a)  A  s2(a)  =>  m,2(x) 
sl(a)  A  s2(6)  =>  m,2(x) 
sl(a)  A  s2(c)  m,2(z) 

sl(6)  A  s2(a)  =>  m.2(y) 

■sl(b)  A  s2(b)  =>  m2(y) 
sl(6)  A  s2(c)  =>  m.2(z) 

il(6)  A  cl(a)  A  c2(a)  =>  -sl(6) 
11(6)  A  cl(a)  A  c2(6)  =>  sl(6) 
11(6)  A  cl(a)  A  c2(c)  =>  sl(a) 
11(6)  A  cl(6)  A  c2(a)  =>  sl(6) 
11(6)  A  cl(6)  A  c2(6)  sl(6) 

11(6)  A  cl(6)  A  c2(c)  =>  sl(a) 


12(a)  A  c3(a)  =>  s2(c) 
12(a)  A  c3(6)  s2(a) 
12(a)  A  c3(c)  =>  s2(6) 
12(6)  A  c3(a)  =>  s2(c) 
12(6)  A  c3(6)  =>  s2(a) 
12(6)  A  c3(c)  =>  s2(6) 


Function  Findlnf luences (N ,  ImmediateSuccessors) 

Nodelnf luenceMatrix  :=  0X0X0 

ForEach  ( bmr  €  modes(N) ,  M  £  ImmediateSuccessors,  bms  €  modes(M)) 

Formulas  :=  {clauses  where  N(bmr)  occurs  in  the  body  and  AI(bms)  occurs  in  the  head} 

a  :=  false 

ForEach  ( (01  A  N(bmr)  A  «2  M(bms ))  £  Formulas) 
a  :  =  a  V  (01  A  Q2) 

Loop 

Nodelnf  luenceMatrix  :=  Nodelnf  luenceMatrix  U  {(N(bmr),  M(bms),normalizg(a))} 

Loop 

Return  Nodelnf luenceMatrix 
EndFunction 

Function  FindlndistinguishableModes (N ,  Modes,  Nodes,  Inf luenceMatrix) 

M  :=  first( Nodes) 

Tlcond  •  —  0 

ForEach  (bmr  £  Modes) 

a  :=  U  bm  (  Inf  luenceMatrix  (N(bmr ) ,  M(bms))  ,  M(bms)  ) 

If  ((n.  ;;)  £  11, Then 

Ilccmd  •  —  I Icond  {  (otj  7T ) }  U  {  (ft,  TV  U  {bill:  }}} 

Else 

n cond  •  —  Ilccmd  U  {{a,  {6m,.}}} 

Endlf 

Loop 

n  :=  U(Q,^)6ncond{7r} 

If  (tail( Nodes)  0) 

ForEach  (tv  £  II) 

II  :=  II  —  tv  U  FindlndistinguishableModes  (N ,  tv,  tail(  Nodes),  Inf  luenceMatrix) 

Loop 

Endlf 
Return  II 
EndFunction 

Function  MergeModes  (Ar,  II,  DT) 

DT’  :=  0 
ForEach  (tv  £  II) 

v  :=  GenerateNewModeName  (tv) 

Formulas  :=  {clauses  for  which  3bmr  £  tv  s.t..  N(bmr)  appears  in  the  body  or  head} 
ForEach  ((<p  =  cti  A  N(bmr)  A  oli  =>  M(bms))  £  Formulas) 

DT’  :=  DT’  U  {{ai  A  N(ir)  A  a2  ^  M (bms))} 

Loop 

ForEach  ((1 p  =  a  N(bmr))  £  Formulas) 

DT’  :=  DT’  U  {(a  =k  iV(^))} 

Loop 

Loop 

Return  DT’ 

EndFunction 


Figure  2.  Sketch  of  the  main  functions  called  by  Abstract  () 


M2 


with  OBS  =  {ml,  m2},  STATES  =  {.si.  .s2}.  COMPS  = 
{cl,  c2,  c3}  and  CAT  =  {il,  i2}. 

Let  the  domains  of  the  variables  be  as  follows: 

DOM  (ml)  =  {x ,  y} ,  DO M (m2)  =  {x,  y,  2} 

DOM(sl)  =  {a,b},  DOM(s2)  =  {a,  b,  c) 

DOM  (cl)  =  {a,  6},  DOM(c2)  =  DOM(c3)  =  {a,  b,  c} 
DOM(il)  =  {a,  6},  DOM(i2)  =  { a,b } 

Furthermore,  let’s  assume  for  simplicity  that  all  the  OBS  are 
available  at  their  maximum  granularity. 

The  algorithm  starts  by  trying  to  merge  modes  of  si. 
The  Inf luenceMatrixO  entries  relating  si  to  ml  are: 
(sl(a),  {(ml(x),  s2 (a)  V  s2(b)  V  s2(c)),  (m.l(y),  T}}) 

(sl(6),  {(777.1(2:),  T),  (777.1(7/),  s2(a)  V  s2(b)  V  s2(c))}) 
it  follows  that  modes  a,b  of  si  can’t  be  merged.  It  is  now  s2 
turn  to  be  considered;  the  entries  relating  s2  to  777.I  are: 
(s2(a),  {(777.1(2:),  sl(a)),  (jnl{y),  sl(6))}) 

(s2(b),  {(777.1(2:),  sl(a)),  (777.1(7;),  sl(6))}> 

(s2(c),  {(777.1(2:),  sl(a)),  (777.1(7/),  sl(fe))}) 
it  may  seem  that  modes  a,  b,  c  of  s2  can  be  merged;  however, 
s2  also  influences  another  manifestation,  m2: 

(s2(a),  {(777.2(2:),  sl(a)),  (m.2(y),  sl(6)),  (777.2(2),  T)}) 

(s2(6),  {(777.2(2:),  sl(a)),  (777.2(7/),  sl(6)),  (tt7.2(z),  .  )}) 

(s2(c),  {(777.2(2:),  T),  (777.2(7/),  T),  (777.2(c),  sl(a)  V  sl(6))}> 
we  can  thus  merge  modes  a,  b  in  new  mode  ab,  but  not  mode 


Having  considered  all  the  states,  we  now  turn  to  the  compo¬ 
nents,  starting  from  cl: 

(cl(a),  {(sl(a),  (*l(a)Ac2(a))V(*l(a)Ae2(6))V(*l(6)Ac2(c))), 
(si (b),  (*l(a)  Ac2(c))  V (*1(6)  Ac2(a))  V (*1(6)  Ac2(&)))}) 
(cl (6),  {(si (a),  (il(a)  f\c2(a))  V(*l(a)  Ac2(6)) V(*l(6)  Ac2(c))), 
(si (6),  (il(a) Ac2(c)) V(*l(&) Ac2(a)) V(*l(&) Ac2(&)))}) 


modes  a,  b  of  cl  can  then  be  merged  in  new  mode  ab ;  note  that 
cl  goes  into  the  trivial-nodes  list,  since  all  its  domain  has  col¬ 
lapsed  into  a  singleton.  Similar  arguments  lead  to  merging 
modes  a,  b  of  c2;  however,  mode  c  of  c2  can’t  be  merged  with 
the  other  two  modes. 

Component  c3  is  the  only  one  left  to  be  considered: 
(c3(a),  {(s2(a&),  _L),  (s2(c),  *2(a)  V  *2(6))}) 

(c3(&),  {(s2(a6),  *2(a)  V  *2(6)),  (s2(c),  T)}) 

(c3(c),  {(s2(a6),  *2(a)  V  *2(6)),  (s2(c),  T)}) 
we  can  merge  modes  6,  c  into  a  new  node  be.  Note  that  we  can 
merge  these  modes  only  because  we  already  unified  modes  a 
and  6  of  s2;  the  importance  of  processing  variables  in  the  y 
relation  order  is  now  evident. 

Note  also  that  we  could  have  considered  for  abstraction  s2 
before  si,  or  c2  before  cl  or  after  c3  since  si,  s2  and  cl,  c2,  c3 
are  not  tied  by  the  precedence  order  relation.  It  is  easy  to  see 
that  in  such  case  the  same  mergings  would  have  taken  place 
anyway. 

The  output  of  the  process  described  above  results  in  a  re¬ 
vised  domain  theory: 


si  (a)  A  s2(ab)  77*1(2:) 
sl(a)  A  s2(c)  =>  77*1(2;) 
sl(6)  A  s2(ab)  777.1(7/) 
sl(&)  A  s2(c)  =£•  777.1(7/) 


sl(a)  A  s2(ab)  =>  m2(x) 
sl(a)  A  s2(c)  777.2(2:) 

sl(6)  A  s2(ab)  777.2(7/) 
•sl(6)  A  s2(c)  =>  777.2(2) 


*l(a)  A  cl(a6)  A  c2(ab)  =>  sl(a) 
*l(a)  A  cl(a6)  A  c2(c)  =>•  sl(6) 
*1(6)  A  cl(a&)  A  c2(ab)  sl(6) 


Figure  3.  System  Structure  Graph  for  the  Example  Domain 
Theory 


*1(6)  A  cl(a6)  A  c2(c)  =>  sl(a) 

*2(a)  A  c3(a)  =>  s2(c) 

*2(a)  A  c3(6c)  =>  s2(ab) 

*2(6)  A  c3(a)  =>  s2(c) 

*2(6)  A  c3(6c)  =>  s2(ab) 


and  abstracted  domains: 

DOM(m.l)  =  {x ,  y} ,  DO M (m.2)  =  {x,  y,z} 

DOM  (si)  =  {a,  6},  DOM(s2)  =  {a&,  c} 

DOM  (cl)  =  { ab} ,  DOM(c2 )  =  {ab,c},  DOAl(c3)  =  {a,  6c} 
DOM(il)  =  {a,  6},  DOM(i2)  =  {a,  6} 


4  Using  Abstract  Models  in  On-Board 
Diagnosis 

Having  described  how  the  abstraction  algorithm  works,  we 
now  consider  how  it  can  be  used  in  real  scenarios  to  practically 
improve  the  performance  of  the  diagnostic  problem  solver. 

A  first  scenario  is  when  the  manifestations  of  the  system 
can  be  naturally  subdivided  in  classes  CL/,  (see  section  2); 
one  such  classes  will  contain  all  the  manifestations  (CLau), 
another  may  contain  only  sensorized  manifestations  (CLaens), 
further  ones  may  exclude  from  CLsens  other  groups  of  mani¬ 
festations  that  can  potentially  all  become  unavailable  together 
in  some  contexts.  Similarly,  manifestation  granularities  (ex¬ 
pressed  as  abstraction  functions  /,)  may  be  identified  and 
associated  to  classes  they  apply  to. 

Equipped  with  this  set  of  pairs  (CLj,,r,),  we  can  generate 
off-line  a  corresponding  set  of  models  MkV,  when  a  specific  di¬ 
agnostic  problem  is  presented  to  the  on-line  diagnostic  agent, 
the  minimal  ( CL/, ,  n)  that  covers  the  available  observations  is 
selected,  and  the  corresponding  model  Mi..,  is  used  to  compute 
diagnoses. 

Sometimes,  however,  classes  of  manifestations  (and  their 
granularity)  cannot  be  conveniently  identified  a-priori.  In  such 
cases  we  may  want  to  compute  an  abstract  model  on- demand, 
given  the  particular  CL/,  and  r,:  that  have  been  identified  as 
currently  available  6. 

The  system  may  perform  this  on-line  model  synthesis  as  a 
lower  priority  task,  asynchronously  with  the  diagnostic  tasks; 

6  How  this  info  can  be  gathered,  either  manually  or  automatically, 

is  out  of  the  scope  of  the  present  paper 


once  the  ad  hoc  Mki  has  been  computed  it  may  be  reused  for 
many  diagnostic  problems  until  some  conditions  on  the  avail¬ 
able  observations  or  their  granularity  changes. 

Obviously,  time  overhead  is  added  by  the  computation  of 
models  but  in  some  situations  this  may  well  be  paid  off  by  the 
benefits  (see  below).  Moreover,  experimental  data  presented 
below  in  section  5  show  that  such  overhead  may  be  in  the  or¬ 
der  of  the  time  needed  for  solving  a  few  easy  diagnostic  cases 
(involving  a  single  fault)  or  just  a  difficult  one  (involving  mul¬ 
tiple  faults);  keeping  in  mind  that  the  abstraction  algorithm 
is  only  run  once  whilst  many  diagnostic  problems  can  exploit 
such  a  abstract  model,  this  overhead  may  be  acceptable. 

In  both  the  above  scenarios,  the  number  of  returned 
diagnoses  is  reduced  by  returning  diagnoses  for  the  abstract 
model  that  correspond  to  sets  of  diagnoses  for  the  detailed 
model  that  carry  essentially  the  same  information,  as  proved 
in  section  3.2. 

Moreover,  whenever  a  diagnosis  for  the  abstract  model 
mentions  a  “compound  mode”  (i.e.  a  new  mode  introduced  in 
place  of  a  non-singleton  set  of  indistinguishable  modes),  we 
explicitly  know  that  the  set  of  modes  it  represents  couldn’t 
be  discriminated  even  in  different  contexts.  Thus,  in  case 
further  tests  involving  different  contexts  are  planned,  they 
shouldn’t  aim  at  that  kind  of  discrimination. 

Both  reduced-size  and  increased  infornrativity  of  the  result 
should  be  helpful  for  the  human  or  automatic  supervisor 
which  must  interpret  it  and  take  action  accordingly. 


5  Experimental  Results 

We  have  implemented  the  algorithm  described  above  as  a 
module  of  the  diagnostic  agent  for  the  space  robotic  arm  SPI¬ 
DER  developed  by  ASI  (Agenzia  Spaziale  Italiana);  for  a  de¬ 
scription  of  the  diagnostic  agent  please  see  [12]  and  [11]. 

The  model  of  the  robotic  arm  (which  obeys  definition  2.1)  is 
enough  complex  to  represent  an  interesting  test-bed:  it  con¬ 
sists  of  35  assumables  (COMPS)  with  an  average  3.43  behav¬ 
ioral  modes  each,  45  manifestations  ( OBS )  and  1143  formulas 

7 

Observations  in  such  a  model  are  explicitly  partitioned 
into  two  classes:  sensorized  (CLsens)  and  non-sensorized 
(CLnosens).  While  observations  in  CLsens  can  reasonably  be 
assumed  to  always  be  available,  observations  in  CLnosens  are 
available  through  manual  measurements  that  can  be  carried 
only  during  off-line  diagnosis. 

We  have  applied  the  abstraction  algorithm  to  the  model 
by  passing  CLsenB  as  the  available  observations  (assuming 
tobs  =  identity,  i.e.  manifestations  available  at  their  maxi¬ 
mum  granularity)  and  obtained  a  simplified  model  as  output. 
Table  1  shows  some  relevant  static  measures  on  the  detailed 
and  abstract  models:  the  number  of  clauses  has  been  reduced 
by  18.6%,  and  the  average  number  of  behavioral  modes  per 
system  component  has  been  reduced  by  16.9%.  Compilation 
of  the  detailed  model  in  the  abstract  one  took  1494msec  of 
CPU  time  (all  results  in  this  section  are  referred  to  a  Java 
implementation  of  both  the  diagnostic  agent  and  the  abstrac¬ 
tion  algorithm,  compiled  and  run  using  jdkl.3  on  a  Sun  Sparc 
Ultra  5  equipped  with  SunOS  5.8). 

7  The  number  of  formulas  is  greatly  reduced  by  the  use  of  a  noisy- 

max  modeling  technique,  see  [12] 


model 

clauses 

modes  avg 

detailed 

1143 

3.43 

abstract 

930 

2.85 

Table  1.  Comparison  between  abstract  and  detailed  models 


We  have  then  compared  the  performance  of  the  diagnostic 
agent  when  it  uses  the  detailed  (original)  versus  the  gener¬ 
ated  abstract  model.  Using  the  simulator  for  the  diagnostic 
agent,  three  test  sets  of  100  diagnostic  problems  each  have 
been  automatically  generated;  problems  in  test  sets  1,  2  and 
3  had  1,  2  and  3  faults  injected  respectively.  Table  2  reports 
on  the  reduction  of  the  average  number  of  diagnoses  returned. 
Particularly  significant  appear  the  reductions  obtained  in  test 
set  2  (-43%)  and  test  set  3  (-61.6%). 

It  should  be  noted  that  the  diagnostic  agent  returns  only  pre¬ 
ferred  diagnoses  (in  particular,  those  that  have  a  minimal 
number  of  faults),  thus  the  reported  reductions  are  obtained 
by  compacting  “good-quality”  diagnoses,  not  by  discarding 
implausible  ones. 


model 

testset  1 

testset  2 

testset  3 

detailed 

5.0  ±  0.6 

17.9  ±  3.6 

123.3  ±  23.1 

abstract 

3.7  ±  0.4 

10.2  ±  1.9 

47.3  ±  8.4 

Table  2.  Average  number  of  elementary  diagnoses  obtained 
with  abstract  and  detailed  models  (confidence  95%) 

Even  if  our  diagnostic  agent  uses  a  compact  encoding  for 
candidate  diagnoses  during  the  search  process,  thus  obtaining 
an  optimized  search  space  size  that  is  not  proportional  to  the 
number  of  diagnoses  ([11]),  the  average  time  employed  for 
solving  problems  using  the  abstract  model  appears  to  be  at 
least  no  worse  than  that  obtained  by  using  the  detailed  model 
(see  table  3). 


model 

testset  1 

testset  2 

testset  3 

detailed 

241  ±  19 

337  ±  45 

1212  ±  182 

abstract 

235  ±  25 

259  ±  32 

988  ±  153 

Table  3.  Average  CPU  times  obtained  with  abstract  and 
detailed  models  (in  msec;  confidence  95%) 

Consistent  results  (both  in  terms  of  static  reduction  of  the 
model  size  and  reduction  of  diagnoses)  have  been  obtained 
by  applying  the  abstraction  algorithm  to  other  subclasses  of 
manifestations  in  the  model.  Space  precludes  reporting  them 
in  this  paper. 

Please  note  that  the  faulty  behavioral  modes  modeled  for 
the  components  were  the  ones  listed  in  the  FMECA  document 
for  the  real  device,  thus  proving  that  the  results  obtained  with 
the  abstraction  algorithm  and  reported  above  are  of  interest 
for  a  real-world  system. 

6  Related  Work  and  Conclusions 

Literature  on  MBD  contains  several  proposals  to  use  abstrac¬ 
tion  as  a  means  of  simplifying  system  model  and,  conse¬ 
quently,  characterization  and  computation  of  diagnoses. 


Some  of  them  formulate  abstraction  rules  and  prove  that  ab¬ 
stractions  obtained  by  their  application  preserve  important 
properties,  i.e.  by  reasoning  at  the  abstract  level  we  don’t 
overlook  any  diagnoses  ([9],  [13]). 

Among  the  rules  proposed  by  Mozetic,  rule  1  (Refin- 
ment/collapse  of  values)  aims  exactly  at  abstracting  sets  of 
values  at  the  detailed  level  into  a  single  value  at  a  more 
abstract  level.  Compared  to  our  approach,  however,  both 
Mozetic  and  Provan  assume  that  the  abstraction  is  done  man¬ 
ually,  by  a  human  knowledgeable  about  the  system  behavior 
and  structure.  Moreover,  they  use  the  abstract  models  only  in 
order  to  reduce  the  computational  complexity,  but  still  return 
detailed  level  diagnoses. 

In  a  recent  paper  ([2]),  the  authors  improve  Mozetic  approach 
so  that  the  hierarchy  (still  manually  provided)  is  automati¬ 
cally  rearranged  on  a  case  by  case  basis  in  order  not  to  hide 
any  available  observations  from  the  abstract  levels. 

Sachenbacher  and  Struss  ([14])  have  defined  a  relational- 
based  approach  (i.e.  the  behavior  model  is  given  as  a  relation 
R  among  tuples  of  variables)  for  automated  abstraction  of 
variables  domains  and  showed  its  usefulness  in  building  sys¬ 
tem  models  by  composing  sub-system  models  and  then  ab¬ 
stracting  away  values  details  that  are  not  of  interest  in  the 
resulting  model.  In  contrast  to  our  approach,  their  work  as¬ 
sumes  that  a  desired  abstraction  Tiarg  be  given  as  part  of  the 
abstraction  problem,  together  with  the  restrictions  on  avail¬ 
able  observations  information  T0t,s  that  appear  also  in  our 
approach. 

Trave-Massuyes,  Escobet  and  Milne  ([15])  define  a  notion  of 
indiscriminability  among  faulted  components  which  is  some¬ 
what  similar  to  our  notion  of  indistinguishability  among  be¬ 
havioral  modes.  Their  work,  which  is  based  on  a  relational 
model  of  the  system,  aims  at  suggesting  the  addition  of  sen¬ 
sors  in  order  to  make  all  the  possible  faults  discriminable. 

In  this  paper  we  have  shown  how  abstraction  of  variable 
domains  in  propositional,  qualitative  system  models,  can  sig¬ 
nificantly  reduce  the  average  number  of  admissible  diagnoses 
when  only  a  subset  of  observables  is  available. 

This  is  particularly  useful  in  on-line  diagnosis,  where  limita¬ 
tions  on  the  number  and/or  granularity  of  observables  can 
likely  apply. 

The  algorithm  presented  in  the  paper  has  a  larger  applicabil¬ 
ity  than  discussed  so  far.  For  example,  it  can  be  used  as  a 
support  tool  for  diagnosability  during  system  modeling  (i.e. 
the  algorithm  can  point  out  discrepancies  between  the  gran¬ 
ularity  of  the  model  being  defined  and  that  of  the  system 
observables). 

There  are  many  directions  we  are  considering  for  extending 
our  work.  The  current  version  of  the  algorithm  assumes  that 
in  the  device  structure  nodes  are  connected  by  at  most  one 
directed  path,  while  representation  of  some  systems  of  prac¬ 
tical  interest  does  not  obey  to  this  restriction. 

We  also  could  explore  how  our  automatic  abstraction  tech¬ 
niques  can  be  extended  in  order  to  merge  together  compo¬ 
nents  whose  contributions  to  the  available  observations  are 
indiscriminable  (i.e.  introducing  the  notion  of  indistinguish¬ 
able  components). 
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Abstract.  When  designing  model-based  fault-diagnosis  systems, 
the  use  of  consistency  relations  (also  called  e.g.  parity  relations)  is 
a  common  choice.  Different  subsets  are  sensitive  to  different  subsets 
of  faults,  and  thereby  isolation  can  be  achieved.  This  paper  presents 
an  algorithm  for  finding  a  small  set  of  submodels  that  can  be  used  to 
derive  consistency  relations  with  highest  possible  diagnosis  capabil¬ 
ity.  The  algorithm  handles  differential-algebraic  models  and  is  based 
on  graph  theoretical  reasoning  about  the  structure  of  the  model.  An 
important  step,  towards  finding  these  submodels  and  therefore  also 
towards  finding  consistency  relations,  is  to  find  all  minimal  struc¬ 
turally  singular  (MSS)  sets  of  equations.  These  sets  characterize  the 
fault  diagnosability.  The  algorithm  is  applied  to  a  large  nonlinear  in¬ 
dustrial  example,  a  part  of  a  paper  plant.  In  spite  of  the  complexity  of 
this  process,  a  small  set  of  consistency  relations  with  high  diagnosis 
capability  is  successfully  derived. 

1  Introduction 

When  designing  model-based  fault-diagnosis  systems,  using  the 
principle  of  consistency  based  diagnosis  [5,  11,  6],  a  crucial  step  is 
the  conflict  recognition.  As  shown  in  [3],  conflict  recognition  can  be 
achieved  by  using  pre-computed  consistency  relations  (also  called 
e.g.  analytical  redundancy  relations  or  parity  relations).  With  prop¬ 
erly  chosen  consistency  relations,  different  subsets  of  consistency  re¬ 
lations  are  sensitive  to  different  subsets  of  faults.  In  this  way  isolation 
between  different  faults  can  be  achieved. 

The  systems  considered  in  this  paper  are  assumed  to  be  modeled 
by  a  set  of  nonlinear  and  linear  differential-algebraic  equations.  To 
find  consistency  relations  by  directly  manipulating  these  equations  is 
a  computationally  complex  task,  especially  for  large  and  nonlinear 
systems.  To  reduce  the  computational  complexity  of  deriving  consis¬ 
tency  relations,  this  paper  proposes  a  two-step  approach.  In  the  first 
step,  the  system  is  analyzed  structurally  to  find  overdetermined  sub¬ 
models.  Each  of  these  submodels  are  then  in  the  second  step  trans¬ 
formed  to  consistency  relations.  The  benefit  with  this  two-step  ap¬ 
proach  is  that  the  submodels  obtained  are  typically  much  smaller 
than  the  whole  model,  and  therefore  the  computational  complexity 
of  deriving  consistency  relations  front  each  submodel  is  substantially 
lower  compared  to  directly  manipulating  the  whole  model. 

The  main  contribution  and  the  focus  of  the  paper  is  a  structural  al¬ 
gorithm  for  finding  these  submodels.  Instead  of  directly  manipulating 
the  equations  themselves,  the  proposed  algorithm  only  deals  with  the 
structural  information  contained  in  the  model,  i.e.  which  variables 
that  appear  in  each  equation.  This  structural  information  is  collected 


in  a  structural  model.  In  addition  to  finding  all  submodels  that  can 
be  used  to  derive  consistency  relations,  the  algorithm  also  selects  a 
small  set  of  submodels  that  corresponds  to  consistency  relations  with 
the  highest  possible  diagnosis  capability. 

In  industry,  design  of  diagnosis  systems  can  be  very  time  con¬ 
suming  if  done  manually.  Therefore  it  is  important  that  methods  for 
diagnosis-system  design  are  as  systematic  and  automatic  as  possible. 
The  algorithm  presented  here  is  fully  automatic  and  only  needs  as 
input  a  structural  model  of  the  system.  This  structural  model  can  in 
turn  easily  be  derived  from  for  example  simulation  models. 

Structural  approaches  have  also  been  studied  in  other  works  deal¬ 
ing  with  fault  diagnosis.  In  [10]  a  structural  approach  is  investi¬ 
gated  as  an  alternative  to  dependency-recording  engines  in  consis¬ 
tency  based  diagnosis.  Furthermore  a  structural  approach  is  used  in 
the  study  of  supervision  ability  in  [2]  and  an  extension  to  this  work 
considering  sensor  placement  is  found  in  [12]. 

In  Sections  2  and  3,  structural  models  and  their  usefulness  in  fault 
diagnosis  are  discussed.  Then  in  Section  4,  a  complete  description  of 
the  algorithm  is  given.  The  algorithm  is  then  in  Section  5  applied  to 
a  large  nonlinear  industrial  process,  a  part  of  a  paper  plant.  In  spite 
of  the  complexity  of  this  process,  a  small  set  of  consistency  relations 
with  high  diagnosis  capability  is  successfully  derived. 

2  Structural  models 

The  behavior  of  a  system  is  described  with  a  model.  Usually  the 
model  is  a  set  of  equations.  A  structural  model  [2]  contains  only  the 
information  of  which  variables  that  are  contained  in  each  equation. 
Let  Morig  denote  the  structural  model  obtained  from  the  equations, 
describing  the  system  to  be  diagnosed.  This  structural  model  will 
contain  three  different  kinds  of  variables:  known  variables  Y,  e.g. 
sensor  signals  and  actuators;  unknown  variables  Xu,  for  example 
internal  states  of  the  system;  and  finally  the  faults  F.  If  faults  are 
decoupled  then  they  will  also  be  included  in  Xu .  The  differentiated 
and  non-differentiated  version  of  the  same  variable  are  considered  to 
be  different  variables.  The  time  shifted  variables  in  the  time  discrete 
case  are  also  considered  to  be  separate  variables. 

A  structural  model  can  be  represented  by  an  incidence  matrix  [4, 
1],  The  rows  correspond  to  equations  and  the  columns  to  variables.  A 
cross  in  position  ( i ,  j)  tells  that  variable  j  is  included  in  equation  i. 

Example  1  A  simple  example  is  a  pump,  pumping  water  into  the  top 
of  a  tank.  The  water  flows  out  of  the  tank  through  a  pipe  connected 
to  the  bottom  of  the  tank.  The  known  variables  are  the  pump  input  u, 
the  measured  water  level  in  the  tank  yu,  and  the  measured  flow  from 


the  tank  yy.  One  fault  denoted  /;  is  assumed  to  be  associated  with 
each  known  variable.  The  actual  flows  to  and  from  the  tank  are  de¬ 
noted  Fi,  and  the  actual  water  level  in  the  tank  is  denoted  h.  Without 
knowing  the  exact  physical  equations  describing  the  analytic  model 
the  structural  model  can  be  set  up  as  follows: 
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Equation  ei  describes  the  pump,  e 2  the  conservation  of  volume  in 
the  tank,  e%  the  water  level  measurement,  ei  the  flow  from  the  tank 
caused  by  the  gravity,  es  the  flow  measurement,  and  ee  a  fault  model 
for  the  flow  measurement  fault  fyf. 

3  Fault  Diagnosis  Using  Structural  Models 

The  task  is  to  find  submodels  that  can  be  used  to  form  consistency 
relations.  To  be  able  to  draw  a  correct  conclusion  about  the  diagnos- 
ability  from  the  structural  analysis,  it  is  crucial  that  for  each  of  these 
submodels  there  is  a  consistency  relation  that  validates  all  equations 
included  in  the  submodel.  The  common  definition  of  consistency  re¬ 
lation  does  not  ensure  this.  Therefore  the  new  definition  of  consis¬ 
tency  relation  for  an  equation  set  is  introduced  that  explicitly  points 
out  the  submodel  considered.  Before  consistency  relation  for  E  is 
defined  some  notation  is  needed. 

Let  x  and  y  denote  the  vectors  of  variables  contained  in  Xv  and 
Y  respectively.  Then  E(x,  y)  denote  an  equation  set  that  depends  on 
variables  contained  in  Xu  and  Y. 

Definition  1  (Consistency  Relation  for  E )  A  scalar  equation 
c(y)  =  0  is  a  consistency  relation  for  the  equations  E(x.,  y)  iff 

3x£(X,y)  <f»c(y)  =  0  (2) 

and  there  is  no  proper  subset  of  E  that  has  property  (2). 

Definition  1  differ  from  the  common  definition  of  consistency  re¬ 
lation  in  two  ways,  the  left  implication  in  (2)  and  that  there  is  no 
proper  subset  of  E  that  has  property  (2).  Refer  the  latter  as  the  min¬ 
imality  condition  in  Definition  1.  The  following  example  shows  the 
importance  of  the  left  implication  in  (2). 

Example  2  Consider  the  model  E  =  {2/1  =  x,  2/2  =  x,  j/3  =  *}. 
The  equation  i/i  —  t/2  =  0  is  not  a  consistency  relation  for  E,  because 
it  is  true  even  if  e.g.  2/3  yi  =  y2  and  then  it  is  impossible  to  find 
a  consistent  x  in  E.  However  y\  —  2/2  =  0  is  a  consistency  relation 
for  {2/1  =x,y2  =  x}. 

The  expression  yi  +  y2  —  2y%  =  0  includes  2/3.  The  right  im¬ 
plication  in  (2)  holds,  but  the  opposite  direction  does  not  hold.  The 
conclusion  is  that  also  this  expression  is  not  a  consistency  relation 
for  E  or  any  equation  subset  of  E. 

However  {y  1  —  t/2)2  +  (2/2  —  Vi)2  =  0  is  a  consistency  relation 
for  E. 

The  minimality  condition  in  Definition  1  is  important,  because  it 
guarantees  that  any  invalid  equation  can  infer  an  inconsistency. 


3.1  Basic  Assumptions 

Basic  assumptions  are  needed  to  guarantee  that  the  subsets  found 
only  by  analyzing  structural  properties  are  exactly  those  subsets  that 
can  be  used  to  form  consistency  relations.  Before  the  basic  assump¬ 
tions  are  presented,  some  notation  is  needed.  Let  E  be  any  set  of 
equations  and  X  any  set  of  variables.  Then  define  varx{E)  =  {x  £ 
A'|3e  £  E  :  e  contains  x}  and  equE(X)  =  {e  £  E\3x  £  X  :  e 
contains  x}.  Also,  let  varx(e)  and  equE{x)  be  shorthand  notations 
for  varx({e})  and  equE({x})  respectively.  If  g  is  any  equation, 
function  or  variable,  let  g ^  denote  the  i:th  time  derivative  of  g.  Then 
define  varx (E)  =  {undifferentiated  x\3i(x^  £  varx(E))},  e.g. 
varxuuy({y  =  x})  =  { y ,  x}.  Finally,  the  number  of  elements  in 
any  set  E  is  denoted  \E\. 

The  first  assumption  is  introduced  to  ensure  that  the  model  be¬ 
comes  finitely  differentiated  in  Section  4.1. 

Assumption  1  The  model  Morig  has  the  property 

YE  C  Morig  ■  \E\  <  \varxuuY(E)\.  (3) 

The  meaning  of  condition  (3)  is  that  each  subset  of  equations  include 
more  or  equally  many  different  variables,  considering  derivatives  as 
the  same  variable.  If  condition  (1)  is  not  fulfilled  and  there  are  no 
redundant  equations,  the  model  would  normally  be  inconsistent. 

As  mentioned  earlier,  the  structural  model  contains  less  informa¬ 
tion  than  the  analytical  model.  The  next  assumption  makes  it  possible 
to  draw  conclusions  about  analytical  properties  from  the  structural 
properties. 

Assumption  2  There  exists  a  consistency  relation  c(y)  =  0  for  the 
equation  set  H  iff 

MX'  C  varXu  {H),  X'  ^  0  :  |X'|  <  \equH{X')\  (4) 

According  to  Assumption  2  the  unknown  variables  in  H  can  be 
eliminated  if  and  only  if  it  holds  that  for  each  subset  of  variables  in 
H  the  number  of  variables  is  less  then  the  number  of  equations  in  H 
which  contain  some  of  the  variables  in  the  chosen  subset. 

The  Assumptions  1  and  2  are  often  fulfilled.  For  example  all  sub¬ 
sets  of  equations  found  in  the  industrial  example  in  the  end  of  the 
paper  satisfy  Assumption  2.  Even  though  the  ’’only  if’  direction  of 
Assumption  2  is  difficult  to  validate  in  an  application,  the  results  of 
the  paper  can  still  be  used  to  produce  a  lower  bound  of  the  actual 
detection  and  isolation  capability. 

If  all  subsets  of  the  model  fulfill  Assumption  2,  the  structural  anal¬ 
ysis  will  find  all  subsets  that  can  be  used  to  find  consistency  relations. 

3.2  Finding  Consistency  Relations  via  MSS  Sets 

Now,  the  task  of  finding  those  submodels  that  can  be  used  to  derive 
consistency  relations  will  be  transformed  to  the  task  of  finding  the 
subsets  of  equations  that  have  the  structural  property  (4).  To  do  this, 
two  important  structural  properties  are  defined  [9]. 

Definition  2  (Structurally  Singular)  A  finite  set  of  equations  E  is 
structurally  singular  with  respect  to  the  set  of  variables  X  if  \E\  > 
\varx{E)\. 

Definition  3  (Minimal  Structurally  Singular)  A  structurally  sin¬ 
gular  set  is  a  minimal  structurally  singular  (MSS)  set  if  none  of  its 
proper  subsets  are  structurally  singular. 


For  simplicity,  MSS  will  always  mean  MSS  with  respect  to  Xu  in 
the  rest  of  the  text.  The  next  theorem  tells  that  it  is  sufficient  and  nec¬ 
essary  to  find  all  MSS  sets  to  get  all  different  sets  that  can  be  utilized 
to  form  consistency  relations.  The  task  of  finding  all  submodels  that 
can  be  used  to  derive  consistency  relations  has  thereby  been  trans¬ 
formed  to  the  task  of  finding  all  MSS  sets. 

Theorem  1  Let  H  C  Mor.lg,  where  Moria  fulfills  Assumption  1. 
Further,  let  H  and  all  Ei  fulfill  Assumption  2.  Then  there  exists  a 
consistency  relation  c(y)  =  Q  for  H(x.,y)  where  \H\  <  oo  iff  H  = 
Ui  Ei  where  for  each  i,  Ei  is  an  MSS  set. 

For  a  proof,  see  [7], 

4  Algorithm  for  finding  and  selecting  MSS  sets 

The  objective  is  to  find  all  MSS  sets  in  a  differentiated  version  of  the 
model  Morig  and  then  choose  a  small  subset  of  these  MSS  sets  with 
the  same  diagnosability  as  the  full  set  of  MSS  sets.  The  algorithm 
can  be  summarized  in  the  following  steps. 

Algorithm  1 

1.  Differentiating  the  model:  Find  equations  that  are  meaningful  to 
differentiate  for  finding  MSS  sets. 

2.  Simplifying  the  model:  Given  the  original  model  and  the  addi¬ 
tional  equations  found  in  step  (1 ),  remove  all  equations  that  can¬ 
not  be  included  in  any  MSS  set.  To  simplify  the  next  step,  merge 
sets  of  equations  that  have  to  be  used  together  in  each  MSS  set. 

3.  Finding  MSS  sets:  Search  for  MSS  sets  in  the  simplified  model. 

4.  Analyzing  Diagnosability:  Examine  the  diagnosability  of  the  MSS 
sets  found  in  step  (3). 

5.  Decoupling  faults:  If  the  diagnosability  has  to  be  improved,  some 
faults  have  to  be  decoupled.  For  decoupling  faults,  return  to 
step  (1)  and  consider  these  faults  as  unknown  variables  in  Xu. 

6.  Selecting  a  subset  of  MSS  sets:  Select  the  simplest  set  of  MSS  sets 
that  contains  the  desired  diagnosability. 

Note  that  to  avoid  searching  for  all  MSS  sets  decoupling  all  possi¬ 
ble  faults.  Algorithm  1  has  been  organized  so  that  first,  the  fault  free 
model  is  analyzed.  Then  if  it  is  necessary  for  achieving  higher  isola¬ 
bility,  faults  are  decoupled.  The  following  sections  discuss  each  of 
the  steps  in  Algorithm  1 . 

4.1  Differentiating  the  Model 

To  handle  dynamic  models,  Algorithm  1  needs  a  way  to  deal  with 
derivatives.  In  this  section  an  algorithm  for  handling  derivatives  is 
defined.  This  algorithm  is  referred  to  as  Algorithm  2.  A  small  exam¬ 
ple  will  show  what  Algorithm  2  must  be  capable  of  handling. 

Example3  Consider  the  model  E  =  {ei,e2,e3}  =  {yi  =  x,yi  = 
x,y$  =  x2}.  It  is  obviously  impossible  to  eliminate  x  in  e 2  if  dif¬ 
ferentiation  of  any  equation  is  forbidden.  In  general,  all  derivatives 
of  E  have  to  be  considered.  If  E<'%)  denote  the  set  of  the  i:th  time 
derivative  of  each  element,  the  equation  set  generally  considered  is 

u  r=oE{i). 

Even  though  varxu(e  1)  =  varxu{e 3)  =  {x}  the  derivatives  of 
ei  and  e$  contain  different  sets  of  variables,  because  varxu(e  1)  = 
{i}  varxu  (e 3 )  =  {x,  x\.  Since  x  is  linearly  contained  in  ei, 
the  variable  x  in  ei  disappears.  Knowledge  about  which  of  the  vari¬ 
ables  that  are  contained  linearly  in  an  equation  determines  the  set  of 
variables  in  the  differentiated  equation  completely. 


For  all  natural  numbers  j,  y['i+1)  —  y^  =  0  is  a  consistency 
relation.  Most  of  these  consistency  relations  contain  high  orders  of 
derivatives  of  y\  and  1/2-  The  derivatives  of  known  variables  are  in 
general  not  known,  but  they  can  usually  be  estimated.  The  higher 
order  of  derivative,  the  more  difficult  it  is  to  estimate  the  derivative. 
Thus  it  is  reasonable  to  make  a  limitation  m(y)for  variable  y  of  the 
order  of  derivative  that  can  be  considered  as  possible  to  estimate. 
Derivatives  up  to  m(y)  are  then  considered  to  be  known  and  higher 
derivatives  belong  to  Xu. 

To  summarize  the  example,  Algorithm  2  must  be  capable  of  differ¬ 
entiating  equations.  To  produce  a  correct  structural  representation  of 
differentiated  equations,  the  algorithm  must  take  linearly  contained 
variables  into  account.  Further,  it  has  to  handle  the  limitation  m(y) 
for  each  y  £  Y. 

Algorithm  2  consists  of  two  parts.  The  first  part  is  a  modification  of 
Pantelides’  algorithm  [9].  Let  M  =  U™=i  Uj=o{ei^}’  t^len  ai 
highest  number  of  differentiations  in  M  of  equation  i.  Then  M  is  a 
differentiated  model  of  Morig  =  U™=i {ei } .  Let  <  i  <  71} 

be  the  set  of  most  differentiated  equations  in  M.  The  highest  deriva¬ 
tive  of  a  non-differentiated  variable  x  in  the  model  M  is  defined  as 
max({i|a:^  €  varxu  (A/)}). 

Pantelides’  algorithm  differentiates  equation  subsets,  so  that  the 
original  equations  together  with  the  differentiated  equations  have  a 
complete  matching  [4]  of  the  most  differentiated  equations  into  the 
unknown  variables  with  the  highest  derivatives. 

The  modification  of  Pantelides’  algorithm  is  that  derivatives  of 
known  variables,  higher  or  equal  to  m(y),  are  also  allowed  to  be 
included  in  the  matching. 

Algorithm  2 

Input:  The  original  model  Morig,  a  description  of  which  variables 
that  are  linearly  contained,  and  for  each  y  £  vdfy(Morig),  m(y)  < 
00. 

(1)  Apply  the  modified  Pantelides’  algorithm  to  Morig  and  the  limits 
m(y).  The  output  is  the  number  of  times  each  equation  must  be 
differentiated  to  find  all  MSS  sets. 

(2)  Differentiate  the  equations  in  Morig  the  number  of  times  sug¬ 
gested  in  step  ( I )  and  use  the  description  of  which  variables  that 
are  linearly  contained,  to  get  the  correct  structural  description  of 
the  differentiated  structural  model  denoted  M,u / / . 

Output:  Mdiff. 

It  is  critical  that  step  (1)  in  Algorithm  2  terminates,  i.e.  no  equation 
should  be  differentiated  an  infinite  number  of  times.  In  Pantelides 
(1988)  the  condition  when  the  algorithm  terminates  is  stated.  This 
condition  can  be  written  as  the  structural  property  (3).  Since  the 
model  Morig  has  this  property  according  to  Assumption  1,  the  al¬ 
gorithm  will  terminate. 

Let  now  MSS(M)  denote  the  set  of  MSS  sets  found  in  equations 
M  and  MSSaii(M)  =  MSS(UiS.0M^).  Then  it  is  possible  to 
state  the  following  theorem  proven  in  [7], 

Theorem  2  If  Assumption  1  is  satisfied  and  for  each  y  £ 
vary (M0rig),  m(y )  <  00,  then 

MS  Sail  (Morig)  =  MSS(Mdiff) 

The  consequence  of  this  theorem  is  that  all  MSS  sets  that  are  possible 
to  find  if  the  original  model  Morig  is  differentiated  an  infinite  number 
of  times,  can  always  be  found  in  Mdiff. 


Example  4  The  following  example  is  a  continuation  of  Example  1 
with  the  structural  model  shown  in  (1).  Let  m(u)  =  m(yf)  =  1 
and  m(yh)  =  0.  According  to  Algorithm  1  the  first  iteration  uses 
the  fault  free  model,  i.e.  all  faults  are  zero.  The  equation  e&  contains 
only  a  fault.  Since  all  faults  are  at  the  moment  assumed  to  be  zero, 
then  ee  is  not  considered.  Further,  assume  that  no  variable  is  linearly 
contained  in  any  equation.  Then  no  variable  will  disappear  in  the  dif¬ 
ferentiation.  The  structural  model  Mdiff  obtained  from  Algorithm  2 
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4.2  Simplifying  the  Model 


This  makes  one  group  of  {ei,  e2,  e4,  es}.  This  search  made  simplifi¬ 
cations  and  therefore  the  search  is  performed  once  more.  The  second 
time  no  simplifications  have  been  done  and  the  simplification  step  is 
therefore  complete.  The  remaining  system  is 
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4.3  Finding  MSS  Sets 

After  the  simplification  step  is  completed,  step  (3)  in  Algorithm  1 
finds  all  MSS  sets  in  the  simplified  model  Msimp.  This  section  ex¬ 
plains  how  the  MSS  sets  are  found. 

The  task  is  to  find  all  MSS  sets  in  the  model  Msimp  with  equations 
{ei,  •  •  •  ,  e„j.  Let  Mk  =  {ek,  ■  ■  ■  ,  e„}  be  the  last  n  —  k  +  1  equa¬ 
tions.  Let  E  be  the  current  set  of  equations  that  is  examined.  The  set 
of  MSS  sets  found  is  denoted  Maig 3.  Then  the  following  algorithm 
finds  all  MSS  sets  in  Msimp. 


It  is  a  complex  task  to  find  all  MSS  sets  in  a  structural  model.  There¬ 
fore  it  can  be  of  great  help  if  it  is  possible  to  simplify  the  model.  Here 
two  kinds  of  simplifications  are  used. 

In  a  first  step,  all  equations  in  Mdiff  that  include  any  variable 
that  is  impossible  to  eliminate,  are  removed.  This  can  be  done  with 
Canonical  Decomposition  [2], 

In  a  second  step,  variables  that  can  be  eliminated  without  losing 
any  structural  information  are  found.  The  rest  of  this  section  will  be 
devoted  to  a  discussion  about  this  second  step. 

If  there  is  a  set  X  C  Xu  with  the  property  1  +  |X|  = 

\squMdiff  (A')|,  then  all  equations  in  equMdiff  ( X )  have  to  be  used 
to  eliminate  all  variables  in  X.  Since  all  unknown  variables  must  be 
eliminated  in  an  MSS  set  this  means  particularly  that  all  MSS  sets 
including  any  equation  of  equMdiff  (A')  has  to  include  all  equations 
in  equMdiff  (A).  The  idea  is  to  find  these  sets.  Then  it  is  possible  to 
eliminate  internal  variables,  here  denoted  X,  in  these  sets.  Every  set 
is  replaced  with  one  new  equation. 

This  second  simplification  step  finds  subsets  of  variables  that  are 
included  in  exactly  one  more  equation  than  the  number  of  variables. 
To  reduce  the  computational  complexity,  a  complete  search  for  such 
sets  is  in  fact  not  performed  here.  Instead  only  a  search  for  single 
variables  included  in  two  equations  is  done.  When  a  variable  is  in¬ 
cluded  in  just  two  equations,  these  equations  are  used  to  eliminate 
the  variable.  If  all  variables  are  examined  and  some  simplification 
was  possible,  then  all  remaining  variables  have  to  be  examined  once 
more.  When  no  more  simplifications  can  be  made,  the  simplification 
step  is  finished  and  the  resulting  structural  model  is  denoted  Msimp. 
Note  that  with  this  strategy  larger  sets  than  two  equations  will  also 
be  found,  since  the  algorithm  can  merge  sets  found  in  previous  steps. 

The  next  theorem  ensures  that  no  MSS  set  is  lost  in  the  simplifica¬ 
tion  step. 

Theorem  3  MSS(Mdiff )  =  MSS(MsimP) 

For  a  proof,  see  [7],  Consider  again  Example  4  and  the  output  (5) 
front  the  differentiation  step.  No  equations  can  be  removed  in  the 
first  simplification  step. 

The  second  step  searches  for  variables  which  belong  only  to  two 
equations.  In  the  first  search,  the  algorithm  finds  F\  in  {ei,  62},  F2  in 
{e4,  es},  and  h  in  the  equations  produced  by  {ei,  62}  and  {e4,  es}. 


Algorithm  3 

Input:  The  model  MSimp. 

1.  Set  k  =  1  and  Maig 3  =  0. 

2.  Choose  equation  ek-  Let  E  =  {efc }  and  X  =  0. 

3.  Find  all  MSS  sets  that  are  subsets  of  Mk  and  include  equation  ek. 

(a)  Let  X  =  varxu(E)\X  be  the  unmatched  variables. 

(b)  If  X  =  0,  then  E  is  an  MSS  set.  Insert  E  into  Maig 3. 

(c)  Else  take  a  remaining  variable  x  £  X  and  let  X  =  X  U 
{*}.  Let  E  =  equMk\E^x)  be  the  remaining  equations.  For 
all  equations  e  in  E  let  E  =  E  U  {e}  and  goto  (a). 

4.  If  k  <  n  set  k  =  k  +  1  and  goto  number  (2). 


Output:  The  set  of  MSS  sets  found,  i.e.  Maigs. 

Algorithm  3  finds  all  MSS  sets  in  Morig  according  to  the  next  theo¬ 
rem  proven  in  [7], 


Theorem  4  Maig3  =  MSS(Msimp) 


The  following  small  example  with  five  equations  shows  how  the  al¬ 
gorithm  works. 
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X  X  X 

4 

X 

5 

X 

This  model  gives  the  following  time  evolution  of  current  equations, 
i.e.  E  in  Algorithm  3  is 
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The  bold  columns  represent  the  MSS  sets  found.  This  example 
also  shows  that  if  there  are  several  matchings  including  the  same 
equations,  the  algorithm  finds  the  same  subset  of  equations  several 
times. 


4.4  Analyzing  Diagnosability 

When  the  MSS  sets  are  found,  the  next  step  is  to  analyze  their  di¬ 
agnosability.  The  continuation  of  the  example  in  (6)  will  be  used  to 
illustrate  how  this  analysis  is  done.  The  4  MSS  sets  that  can  be  found 
in  (6)  are  shown  in  the  left  column  in  Figure  1  (a).  The  matrix  in  this 
figure  is  the  incidence  matrix  of  the  MSS  sets  in  (6).  If  any  equation 
in  the  MSS  set  i  include  fault  j,  the  element  (i,  j)  of  the  incidence 
matrix  is  equal  to  X.  Note  that  an  X  in  position  (i,  j)  is  no  guar¬ 
antee  for  fault  j  to  appear  in  the  MSS  set  i.  For  an  example  of  the 
interpretation  of  an  incidence  matrix,  consider  the  third  MSS  set  in 
Figure  1  (a).  This  MSS  set  could  contain  fu  and  fyf,  but  it  is  impos¬ 
sible  that  it  could  contain  fyh,  since  fyh  is  only  included  in  equation 
e3.  For  simplicity,  the  derivatives  of  the  faults  are  omitted  in  Figure  1 . 

If  the  number  of  different  faults  is  large  it  is  not  easy  to  see  which 
faults  that  can  be  isolated  from  each  other.  The  incidence  matrix  of 
the  MSS  sets  show  which  faults  that  could  be  responsible  for  an  in¬ 
consistency  of  each  MSS  set,  but  it  is  more  interesting  to  see  which 
faults  that  can  be  explained  by  other  faults.  A  fault  matrix  shows  the 
maximum  isolation  and  detection  capability  of  the  diagnosis  system. 
The  maximum  isolation  capability  with  a  diagnosis  system  designed 
with  this  structural  method  is  obtained  if  it  is  assumed  that  each  fault 
makes  all  MSS  sets  including  this  fault  inconsistent.  If  fault  j  is  sen¬ 
sitive  to  at  least  all  MSS  sets  that  fault  i  is  sensitive  to.  then  element 
(i,  j)  of  the  fault  matrix  is  equal  to  A'.  The  interpretation  of  an  A'  in 
position  ( i ,  j )  is  that  fault  /,;  can  not  be  isolated  from  fault  fj. 

The  fault  matrix  corresponding  to  the  incidence  matrix  in  Fig¬ 
ure  1  (a)  is  shown  in  Figure  1  (b).  Consider  the  first  row  of  the  fault 
matrix.  Suppose  that  fault  /„  is  present.  Then,  the  first  three  MSS 
sets  are  not  satisfied  in  an  ideal  case.  This  means  that  fu  certainly 
can  explain  fault  fu ,  but  also  fyf  can  explain  fault  fu.  Fault  fyh 
cannot  explain  fault  /„,  since  if  fyh  is  present,  the  third  MSS  set  is 
satisfied.  Note  that  the  fault  matrix  is  not  symmetric.  For  example 
fault  fyf  can  explain  fault  fu  but  the  opposite  is  not  true.  The  fault 
matrix  can  more  easily  be  analyzed  after  Dulmage-Mendelsohn  per¬ 
mutations  [8],  This  algorithm  returns  a  maximal  matching  [4]  which 
is  in  block  upper-triangular  form.  The  diagonal  blocks  corresponds 
to  strong  Hall  components  of  the  adjacency  graph  of  the  fault  ma¬ 
trix.  The  interpretation  is  that  faults  in  a  diagonal  block  can  never 
be  distinguished  with  that  diagnosis  system.  In  the  small  example  in 
Figure  1  (b).  the  same  matrix  is  returned  after  Dulmage-Mendelsohn 
permutations,  which  usually  is  not  the  case.  The  diagonal  blocks  are 
the  1  x  1  diagonal  elements. 


(a)  (b) 

Figure  1.  The  incidence  matrix  of  MSS  sets  is  shown  in  (a).  The  fault 
matrix  of  (a)  is  shown  in  (b). 


4.5  Decoupling  faults 

Suppose  that  the  element  (i,  j)  of  the  fault  matrix  is  equal  to  X  for 
some  i  ^  j.  It  could  still  be  possible  to  isolate  fault  i  from  fault 
j  by  trying  to  decouple  fault  j.  Include  fault  j  among  the  unknown 


variables  Xu  and  search  for  new  MSS  sets  by  applying  Algorithm  1 
step  (1)  to  the  new  model  obtained.  An  MSS  set  that  is  able  to  isolate 
fault  i  from  fault  j  has  to  include  at  least  one  equation  that  includes 
fault  i.  If  any  such  MSS  set  is  found,  it  has  to  include  an  elimination 
of  fault  j.  If  not,  this  MSS  would  have  been  discovered  earlier. 

In  the  example  in  Figure  1,  the  fault  matrix  shows  that  fu  and  fyh 
can  not  be  isolated  from  fyf.  The  problem  is  that  there  is  no  MSS  set 
that  decouple  fault  fyf.  But  there  could  be  one  if  fyf  is  eliminated. 
The  fault  fyf  is  moved  from  the  faults  F  to  the  unknown  variables 
Xu.  The  procedure  starts  all  over  from  the  step  (1)  in  Algorithm  1. 
The  result  is  a  new  MSS  set  in  which  fyf  is  decoupled.  This  gives  a 
possibility  to  detect  and  isolate  all  faults. 


4.6  Selecting  a  Subset  of  MSS  Sets 

It  is  not  unusual  that  the  number  of  MSS  sets  found  is  very  large. 
Many  of  the  MSS  sets  probably  use  almost  as  many  equations  as  un¬ 
known  variables  in  the  entire  system.  These  MSS  sets  usually  rely 
on  too  many  uncertainties  to  be  usable  for  fault  isolation.  Small  MSS 
sets  are  more  robust  and  are  usually  sensitive  to  fewer  faults.  There¬ 
fore  the  goal  must  be  to  find  the  set  of  most  robust  MSS  sets  but  with 
the  same  diagnosis  capability  as  the  set  of  all  MSS  sets. 

Start  to  sort  the  MSS  sets  in  an  ascending  order  of  complexity.  The 
complexity  measure  is  here  the  number  of  equations,  even  though 
more  informative  measures  are  also  a  possibility.  The  MSS  sets  are 
examined  in  the  rearranged  order.  If  an  MSS  set  increase  the  diag¬ 
nosability,  then  select  the  MSS  set.  The  diagnosability  is  increased  if 
some  fault  becomes  detectable  or  some  fault  i  can  be  isolated  from 
some  other  fault  j.  This  means  that  for  each  detection  of  a  fault  and 
for  each  isolation  between  two  faults,  the  smallest  MSS  sets  with  this 
diagnosis  ability  will  be  one  of  the  chosen  MSS  sets.  In  this  way  the 
final  output  from  Algorithm  1  will  be  the  most  robust  set  of  MSS  sets 
with  highest  possible  diagnosis  capability. 


5  Industrial  example:  A  part  of  a  paper  plant 

This  example  is  a  stock  preparation  and  broke  treatment  system  of  a 
paper  plant  located  in  Australia.  The  system  is  used  for  mixing  and 
purifying  recycled  paper  for  production  of  new  paper.  An  overview 
of  the  system  is  shown  in  Figure  2. 


To  sedimentation 


Figure  2.  A  stock  preparation  and  broke  treatment  system  of  a  paper  plant. 


5.3  Simplifying  the  Model 


5.1  System  Description 

Most  parts  of  the  system  are  nonlinear  and  it  is  only  the  tank  and  the 
pulper  that  are  considered  to  be  dynamic.  The  model  has  shown  to 
compare  well  to  real  measured  data.  Because  of  space  considerations, 
the  details  of  the  model  are  omitted,  but  can  be  found  in  [7],  The 
system  has  4  states:  the  volume  and  concentration  in  the  pulper  and  in 
the  tank.  There  are  6  sensors  in  the  system.  Sensor  y\  and  t/3  measure 
the  water  levels  of  the  pulper  and  the  tank  respectively,  1/2  and  t/4 
measure  concentration,  t/5  and  -ye  measure  pressure.  The  flows  and 
concentrations  into  this  system  are  known  and  the  flows  out  from  the 
system  are  also  known.  There  are  6  valves  and  two  pumps  that  are 
actuators  with  known  inputs. 

There  are  21  faults  that  are  considered.  All  sensors  can  have  a  con¬ 
stant  offset  fault  (/1, . . . ,  fe).  All  valves  can  have  a  constant  offset 
in  the  actuator  signal  (fr, . . . ,  /i2).  Clogging  can  occur  in  the  pipes 
near  the  valves  (/13, . . . ,  /is)  and  also  directly  after  the  tank  /19. 
Finally,  the  pumps  can  have  a  constant  offset  in  the  actuator  signal 
(/20,  /21). 

The  system  is  described  by  29  equations.  Equations  (ei,..., 
64)  describe  the  dynamics,  (es,...,ei4)  are  pressure  loops,  eis 
relates  the  concentration  in  the  junction  after  the  tank  with  the 
flows  .F4  and  Fe,  (ei6,  ei.7)  describe  the  two  pumps,  (eis, . . . ,  623) 
are  valve  equations,  (e24, . . . ,  e26)  are  flow  equations,  and  finally 
(e27, . . . ,  e2g)  are  sensor  equations  for  sensor  1,  2,  and  3.  The  struc¬ 
tural  model  for  these  equations  can  be  viewed  in  the  first  29  rows  in 
the  matrices  in  Figure  3. 


In  the  first  step  of  simplification  applied  to  the  left  matrix  in  Figure  3, 
the  equations  {27,  28,  29}  include  variables  belonging  only  to  one 
equation,  i.e.  they  cannot  be  included  in  any  MSS  sets. 

The  second  part  of  the  simplification  finds  that  the  vari¬ 
ables  {9, 17, 18, 19,  20,  21,  25,  26,  27,  28,  29,  30, 31}  can  be  elim¬ 
inated.  The  equations  that  form  groups  are  {1,52},  {2,53}, 
{3,  54},  {4, 15, 40},  {32, 41, 44},  {39, 48,  51},  {31, 43},  {35,  45}, 
{37, 46}  and  {36, 47}.  The  simplified  structural  model  is  shown  in 
Figure  4  (a).  Note  the  simplification  of  the  model  by  comparing  Fig¬ 
ure  3  and  Figure  4  (a). 


(a)  (b) 


5.2  Differentiating  the  Model 

The  highest  order  of  derivatives  that  is  known  for  all  known  vari¬ 
ables  are  assumed  to  be  one.  If  a  variable  is  contained  linearly  in 
an  equation  the  variable  disappears  in  the  differentiated  expression. 
This  knowledge  is  used  since  the  equations  are  known.  Algorithm  2 
is  applied  to  the  first  29  equations  in  Figure  3.  The  result  is  that  all 
equations  except  equation  1,  2,  3,  and  4  are  differentiated.  This  re¬ 
sults  in  additionally  25  differentiated  equations  shown  in  the  lower 
part  of  Figure  3. 


unknown  variables  faults 


Figure  3.  Structural  model  of  the  stock  preparation  and  broke  treatment 
system. 


Figure  4.  The  simplified  structural  model  is  shown  in  (a).  The  incidence 
matrix  of  the  MSS  sets  is  shown  in  (b) 


5.4  Finding  MSS  sets 

Algorithm  3  is  then  applied  to  the  simplified  model.  The  algorithm 
returns  35770  MSS  sets  that  are  contained  in  the  simplified  model. 
The  largest  MSS  set  consists  of  24  equations. 

5.5  Analyzing  Diagnosability 

The  two  different  fault  matrices  are  seen  in  Figure  5.  The  Dulmage- 
Mendelsohn  permutations  gives  that  the  faults  {7, 13},  {8, 14}, 
{9, 15},  {10, 16}, {11, 17}  and  {12, 18}  are  never  distinguishable. 
These  pairs  of  faults  all  belong  pairwise  to  the  same  valve.  This  iso¬ 
lation  performance  for  faults  concerning  valves  is  in  this  case  ac¬ 
ceptable.  To  give  an  example  of  how  elimination  of  faults  is  done, 
the  attention  is  focused  on  isolating  faults  4,  8,  and  14. 

5.6  Decoupling  faults 

Considering  Figure  5,  it  is  still  important  to  discover  if  any  MSS  set 
can  decouple  fault  2  or  3  and  be  sensitive  to  fault  4.  It  is  also  neces¬ 
sary  to  decouple  fault  20.  Apply  Algorithm  1  to  the  original  model, 
but  where  fault  2  now  is  considered  to  be  an  unknown  variable.  Then 
apply  the  Algorithm  1  to  the  model  where  faults  3  is  decoupled  and 
finally  also  when  fault  20  is  decoupled.  The  algorithm  finds  thereby 
additional  MSS  sets  that  isolate  fault  4,  8,  and  14. 


5.7  Selecting  a  subset  of  MSS  sets 

The  24  chosen  MSS  sets  are 


MSS 

1  T3 

2  2  53 

3  6  18 

4  11  22 

5  1  16  52 

6  22  36  47 

7  7  16  19 

8  8  9  17  24 

9  9  10  17  20 

10  12  172125 

11  16  19  32  4144 

12  8  10  17  20  24  (7) 

13  12  14  2123  26 

14  14  17  23  25  26 

15  17  24  33  34  42  49 

26  7  16  19  32  4144 

17  17  2125  37  42  46  50 

18  8  10  12  20  21  24  25 

19  17  23  25  26  39  42  48  50  51 

20  3  4  15  16  17  24  40  42  49  54 

21  1  3  4  15  17  24  40  42  49  52  54 

22  3  4  8  10  15  16  20  33  35  40  45  54 

23  2  3  4  15  16  17  24  40  42  49  53  54 

24  3  4  8  9  15  16  17  24  40  42  49  54 


From  these  sets  and  the  structural  model  in  Figure  3  the  incidence 
matrix  in  Figure  4  (b)  is  obtained. 


Figure  5.  These  matrices  are  the  fault  matrices  before  (a)  and  after  (b)  the 
Dulmage-Mendelsohn  permutation. 


5.8  Generating  Consistency  Relations 

Consistency  relations  corresponding  to  the  24  MSS  sets  are  calcu¬ 
lated  by  using  the  function  Eliminate  in  Mathematica.  Most  of  the 
equations  in  the  model  are  polynomial  equations.  For  polynomial 
equation-systems,  the  function  Eliminate  uses  Grobner  Basis  tech¬ 
niques  for  elimination.  Each  MSS  set  with  7  or  less  equations  was 
easily  eliminated  to  a  consistency  relation.  The  consistency  relations 
from  the  MSS  set  17  and  18  were  obtained  front  the  Eliminate  func¬ 
tion,  but  were  to  complex  to  be  numerically  reliable.  Elimination  of 
the  unknown  variables  in  MSS  sets  with  8  or  more  equations  was 
computational  intractable  with  the  Eliminate  function.  Therefore,  by 
using  only  consistency  relations  obtained  from  the  15  first  MSS  sets, 
the  isolation  capability  was  reduced  slightly.  Some  further  results  of 
the  investigation  can  be  found  in  [7], 


the  consistency  relations,  which  give  the  fault  detection  and  the  fault 
isolation  capability. 

The  method  is  capable  of  handling  general  differential-algebraic 
non-causal  equations.  Further,  the  method  is  not  limited  to  any  spe¬ 
cial  type  of  fault  model.  Algorithm  1  finds  all  submodels  that  can 
be  used  to  derive  consistency  relations  and  this  is  proven  in  Theo¬ 
rem  2,  3,  and  4.  The  key  step  in  Algorithm  1  is  step  (3)  that  finds  all 
MSS  sets  in  the  model  it  is  applied  to. 

Finally  the  method  has  been  applied  to  a  large  nonlinear  industrial 
example,  a  part  of  a  paper  plant.  The  algorithm  successfully  manage 
to  derive  a  small  set  of  submodels.  In  spite  of  the  complexity  of  this 
process,  a  sufficient  number  of  submodels  could  be  transformed  to 
consistency  relations  so  that  high  diagnosis  capability  was  obtained. 
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6  Conclusion 

This  paper  has  presented  a  systematic  and  automatic  method  for  find¬ 
ing  a  small  set  of  submodels  that  can  be  used  to  derive  consistency 
relations  with  highest  possible  diagnosis  capability.  The  method  is 
based  on  graph  theoretical  reasoning  about  the  structure  of  the  model. 
It  is  assumed  that  a  condition  on  algebraic  independency  is  fulfilled. 

An  important  idea,  towards  finding  these  submodels,  is  to  use  the 
mathematical  concept  minimal  structurally  singular  sets.  These  sets 
have  in  Theorem  1  been  shown  to  characterize  these  submodels,  i.e. 
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Abstract.  We  consider  multilevel  set-covering  models  for  diagnos¬ 
tic  reasoning:  though  a  lot  of  work  has  been  done  in  this  field,  knowl¬ 
edge  acquisition  efforts  have  been  investigated  only  insufficiently. 
We  will  show  how  set-covering  models  can  be  build  incrementally 
and  how  they  can  be  refined  by  knowledge  enhancements  or  repre¬ 
sentational  extensions.  All  these  extensions  have  a  primary  charac¬ 
teristic:  they  can  be  applied  without  changing  the  basic  semantics  of 
the  model. 
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1  Introduction 

In  this  paper  we  will  present  a  new  interpretation  of  set-covering 
models  HI  which  is  a  suitable  representation  for  the  manual  devel¬ 
opment  of  knowledge-based  systems.  Because  of  its  simple  seman¬ 
tics  set-covering  models  are  rapidly  understood  by  the  experts,  but 
still  maintain  a  well-known  model-based  interpretation.  In  G3  we 
showed  how  knowledge-based  diagnostic  systems  can  be  developed 
incrementally  with  set-covering  models,  thus  supporting  rapid  pro¬ 
totyping  of  such  systems.  In  this  paper  we  will  extend  this  approach 
to  multilevel  set-covering  models,  and  we  will  describe  how  simple 
set-covering  models  can  be  enhanced  by  representational  extensions. 
Practical  experience  has  shown  that  these  additions  facilitated  the  de¬ 
velopment  of  a  real  world  example  from  a  medical  ICU  domain. 

A  set-covering  model  consists  of  a  set  of  diagnoses,  a  set  of  find¬ 
ings  (observations)  and  covering  relations  between  the  elements  of 
these  two  sets.  There  exists  a  covering  relation  between  a  diagnosis 
and  a  finding,  iff  the  diagnosis  implies  the  observation  of  the  find¬ 
ing.  We  can  define  covering  relations  between  diagnoses  as  well,  iff 
a  diagnosis  implies  the  observation  of  another  diagnosis.  The  basic 
idea  of  set-covering  diagnosis  is  the  detection  of  a  reasonable  set  of 
diagnoses  which  can  explain  the  given  observations.  To  do  this,  we 
propose  an  abductive  reasoning  step:  Firstly,  hypotheses  are  gener¬ 
ated  in  order  to  explain  the  given  observations.  Secondly  competing 
hypotheses  are  ranked  using  a  quality  measure. 

Reasoning  with  set-covering  models  has  got  a  long  tradition  in  di¬ 
agnostic  reasoning:  Early  work  was  done  by  Patil  (3)  with  his  sys¬ 
tem  ABEL,  which  implemented  a  comprehensive  set-covering  rep¬ 
resentation  including  causal,  associational  and  grouping  relations. 
Reggia  et  al.  jTj  contributed  a  formal  approach  to  set-covering  mod¬ 
els  and  addressed  the  problem  of  hypothesis  generation  with  a  pre¬ 
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cise  description  of  generator  sets.  Later  (4)  they  introduced  the  in¬ 
tegration  of  Bayesian  probabilities  in  set-covering  models.  With  the 
system  MOLE  [9  Eshelman  focussed  on  the  problem  of  acquiring 
set-covering  knowledge.  He  proposed  an  interactive  process  that  al¬ 
lows  for  refining  previously  acquired  knowledge  after  a  reasoning 
step  to  differentiate  between  conflicting  hypotheses.  Console  et  al. 
(6j  showed  with  the  system  CHECK  how  to  combine  heuristic  and 
causal  knowledge.  There  heuristic  knowledge  was  used  to  find  rea¬ 
sonable  hypotheses  for  a  given  observation.  In  a  second  step  the 
causal  knowledge  was  used  to  generate  abductive  explanations  for 
the  hypotheses.  Long  m  extended  covering  models  with  probabili¬ 
ties  and  a  rich  syntax  of  temporal  and  non-temporal  causation  events. 
Since  knowledge  acquisition  is  a  cost  sensitive  task,  reuse  of  existing 
knowledge  is  another  emerging  aspect.  Puppe  f§|  showed  how  set¬ 
covering  knowledge  can  be  combined  with  other  classes  of  knowl¬ 
edge  like  heuristic  rules,  case-based  knowledge  or  decision  trees. 

Most  of  these  approaches  only  investigated  syntax  and  semantics  of 
the  reasoning  process,  but  did  not  consider  the  knowledge  engineer¬ 
ing  process.  Eshelman's  MOLE  system  El  differs  from  our  knowl¬ 
edge  acquisition  approach,  since  there  knowledge  refinement  is  per¬ 
formed  by  adding  new  covering  relations  to  the  model.  In  our  paper 
we  will  present  (multilevel)  set-covering  models  and  show  how  to 
enrich  these  simple  models  with  knowledge  enhancements  like  simi¬ 
larities  and  weights  or  representational  extensions  for  more  complex 
covering  relations.  A  primary  characteristic  of  the  presented  exten¬ 
sions  is  the  incrementality:  each  extension  can  be  applied  indepen¬ 
dently  from  other  enhancements  and  will  not  change  the  basic  se¬ 
mantics  of  the  model,  but  refine  special  aspects  of  it. 

The  rest  of  the  paper  is  organized  as  follows:  In  Section  [2]  we  will 
introduce  the  basic  concepts  of  set-covering  models  and  show  how 
to  enrich  set-covering  models  with  additional  knowledge  like  simi¬ 
larities  and  weights.  Beyond  that  we  will  introduce  representational 
extensions  of  set-covering  models  in  Section[3]that  will  enable  us  to 
formulate  exclusions,  necessary  relations  and  complex  covering  rela¬ 
tions  (conjunctions,  disjunctions,  cardinalities).  In  Section[4]we  will 
shortly  summarize  the  problem  of  hypothesis  generation  and  we  will 
introduce  constraints  that  shrink  the  exponentiell  size  of  possible  hy¬ 
potheses.  We  will  conclude  this  paper  in  Section[5]with  an  overview 
of  the  work  we  have  done  so  far  and  promising  directions  we  are 
planning  to  work  on  in  the  future. 

2  Set-Covering  Models 

A  set-covering  model  consists  of  a  set  of  diagnoses,  a  set  of  findings 
(observations)  and  covering  relations  between  the  elements  of  these 
two  sets.  There  exists  a  covering  relation  between  a  diagnosis  and 


a  finding,  iff  the  diagnosis  predicts  the  observation  of  the  finding. 
Furthermore  we  can  define  covering  reiations  between  two  diagnoses 
to  state  that  a  diagnosis  implies  another  diagnosis.  In  this  way  we 
can  build  a  covering-tree  for  a  diagnosis,  where  we  postulate  that 
the  leafs  of  the  covering-tree  have  to  be  observable  findings.  So  each 
covering  path  will  start  with  a  diagnosis  and  lead  to  an  observable 
finding. 

2.1  The  Basic  Model 


In  the  worst  case  this  procedure  will  generate  2n  candidates  for  n 
diagnoses.  So  heuristics  are  needed  to  keep  the  method  computation¬ 
ally  tractable  (c.f.  Section]?}. 

The  basic  sets  for  this  task  are  the  following:  We  define  Dd  to  be 
the  set  of  all  diagnoses  and  £7. 4  the  set  of  all  observable  parameters 
(attributes).  To  each  parameter  A  £  FI  a  a  range  dom(A)  of  values 
is  assigned,  and  fly  =  d°m{A)  is  the  set  of  all  possible 

values  for  the  parameters.  If  a  parameter  A  is  assigned  to  a  value  v, 
then  we  call  A :  v  a  finding. 


The  basic  idea  of  set-covering  diagnosis  is  the  detection  of  a  reason¬ 
able  set  of  diagnoses  which  can  explain  the  given  observation  of  find¬ 
ings.  In  an  abductive  reasoning  step  hypotheses  are  firstly  generated 
in  order  to  explain  the  given  observations  (hypothesis  generation).  In 
a  second  step,  we  define  a  quality  measure  for  ranking  competing  hy¬ 
potheses  (hypothesis  testing).  Set-covering  models  describe  relations 
like: 

A  diagnosis  D  predicts  that  the  parameters  Ai, . . . ,  An  are 
observed  with  corresponding  values  vi, ...  ,v„. 

A  diagnosis  D  predicts  the  diagnoses  D\ , . . . ,  Dm. 


We  call  each  of  these  relations  covering  relations  and  we  denote  them 
by 

n  mD  — ■>  AiWi,  1  <i<n, 
r'i  =  D  — >  Di,  1  <  i  <  m. 

Covering  models  can  be  visually  described  like  in  Figure  [T]  In  this 


Figure  1.  Basic  set-covering  model  for  diagnoses  Flu,  Fever  and  Cold. 


example  the  model  states  that  diagnosis  Flu  implies  the  observation 
of  the  diagnoses  Fever  and  Cold.  Diagnosis  Fever  itself  forces  the 
observation  of  the  attributes  Temperature  and  Skin  with  their  corre¬ 
sponding  values  Increased  and  Sweating. 

The  basic  algorithm  for  set-covering  diagnosis  is  very  simple:  Given 
a  set  of  observed  findings,  it  uses  a  simple  hypothesize-and-test  strat¬ 
egy,  which  generates  hypotheses  (coined  from  diagnoses)  in  the  first 
step  and  tests  them  against  the  given  observations  in  a  second  step. 
The  test  is  defined  by  calculating  a  quality  measure,  which  expresses 
the  covering  degree  of  the  hypothesis  regarding  the  observed  find¬ 
ings.  The  generation  and  evaluation  of  the  hypotheses  is  an  iterative 
process,  which  stops  when  a  satisfying  hypothesis  has  been  found  or 
all  hypotheses  have  been  considered.  Usually  the  algorithm  will  look 
at  single  diagnoses,  compute  the  corresponding  quality  measure,  and 
then  it  will  generate  hypotheses  with  multiple  diagnoses,  if  needed. 


Fin  =  {  A :  v  |  A  £  Da,  v  £  dom(A)  } 


is  the  set  of  all  findings.  Furthermore  we  call  an  element  S  £  FIs  = 
D-d  U  Fin  a  state. 

A  covering  relation  r  between  a  diagnosis  D  and  a  state  S  (S  7^  D) 
is  denoted  by  r  =  D  — >  S.  We  say  that  “D  predicts  S”  or  that  “D 
covers  S”.  Then  Cr  =  D  is  called  the  cause  and  er  =  S  is  called  the 
effect.  We  define  Fin  to  be  the  set  of  all  covering  relations  contained 
in  the  model.  Then  D+  £  Fin  is  the  set  of  all  covering  relations  with 
diagnosis  D  as  the  cause,  i.e.  D+  =  {r  £  Fin  \  cr  =  D}.  E.g., 
for  the  model  in  Figure  [I]  we  obtain  cri  =  Flu  and  eri  =  Fever, 
Cold+  =  {r5,r6}. 

Since  S  can  be  a  diagnosis  itself,  we  are  able  to  build  multilevel  set¬ 
covering  models.  A  state  S  transitively  covers  another  state  S',  if 
either  S  covers  S'  or  S  covers  another  state  S"  that  transitively  cov¬ 
ers  S'. 


We  call  To  C  Fin  the  set  of  observed  findings  and  a  set  H  C  Dd 
of  diagnoses  a  hypothesis.  A  finding  that  is  not  transitively  covered 
by  the  hypothesis  TL  is  called  isolated,  and  the  set  of  all  observed 
findings  that  are  isolated  will  be  denoted  by  Tft°oted  C  To  -  E.g.  for 
a  hypothesis  Tt  =  {Z?i }  and  To  =  {A±  :  Vi,  A2  :  V2,  A 4  :  V4}  we 
obtain  T%°lSted  =  {A2:v2}. 


Figure  2.  Basic  set-covering  model  for  diagnosis  D 


Now  we  will  describe  the  computation  of  the  precision  of  a  state  for 
a  given  observation.  The  precision  iv(S)  of  a  state  S  provides  a  real 
value  between  0  and  1  to  describe  the  degree  of  accuracy  the  covered 
states  of  S  are  observed. 

Bottom-Up  Computation  of  Precisions.  Given  the  set  To  of  ob¬ 
served  findings,  the  precision  7r  of  each  state  is  computed  bottom-up 
starting  with  the  findings: 


7r(A :  v) 


1,  if  A:v  £  To 
0,  otherwise 


(1) 


The  precision  n(D)  of  a  diagnosis  D  can  be  computed  as  soon  as  the 
precisions  of  all  its  successors  S  are  known.  For  this  we  define 

-D>c  =  {r  £  D+\  7r(er)  >  c(er)  }, 

D>0  =  {  r  £  D+  1 7r(er)  >  0  } , 

as  the  sets  of  all  relevant  covering  relations,  i.e.  relations  that  predict 
states  with  a  precision  greater  than  a  user  defined  threshold  function. 

f  E 

I  r£D+ 

AD)  =  l  !~+o|  ,  if- Dto  *  0  (2) 

I  0,  otherwise 

The  denominator  counts  all  successor  states  of  D  with  a  positive  pre¬ 
cision,  which  gives  us  the  maximally  achievable  score.  The  nomina¬ 
tor  sums  up  the  precision  of  all  successor  states  with  a  precision,  that 
is  greater  than  or  equal  to  the  completeness  value,  which  gives  us  the 
actually  achieved  score. 

The  completeness  value  c(D)  of  a  diagnosis  is  specified  by  the  mod¬ 
eler  and  is  motivated  by  the  fact,  that  a  covering  model  for  a  diagnosis 
will  contain  more  states  than  the  diagnosis  will  cause  in  an  average 
case.  Nevertheless  in  most  cases  the  observation  of  a  percentage  of 
the  modeled  states  will  legitimate  the  validation  of  this  diagnosis.  To 
emphasize  this  percentage  the  modeler  has  to  specify  a  completeness 
value  c(D).  Unless  this  factor  is  reached  by  the  observation  set  in 
the  current  case,  the  diagnosis  may  neither  be  considered  as  a  validly 
observed  state,  nor  will  it  be  considered  as  a  valid  hypothesis  candi¬ 
date. 

Since  we  also  want  to  consider  multiple  faults,  i.e.  hypotheses  con¬ 
taining  more  than  one  diagnosis,  we  define 

n+  =  U  D+  Ht 0  =  U  U>c  =  U  D>c 

Den  new  Den 

The  covering  relations  r  £  TLZ.C  are  called  relevant  for  TL.  Observe, 
that  relevancy  depends  on  To,  since  the  precisions  have  been  com¬ 
puted  based  on  To  ■ 

Quality  Measures.  The  quality  measures  are  used  to  rank  the  possi¬ 
ble  hypotheses  with  respect  to  the  given  observation.  As  we  already 
introduced  the  precision  of  a  single  diagnosis  we  now  will  define 
the  quality  of  a  hypothesis,  which  can  contain  multiple  diagnoses. 
The  quality  of  a  hypothesis  provides  a  real  value  between  0  and  1 
to  describe  the  degree  of  accuracy  with  which  the  hypothesis  Ti  can 
explain  the  given  observation  To  ■ 

Definition  2.1  (Quality  Measure)  The  quality  g(Ti)  of  a  hypothe¬ 
sis  Ti  is  given  by 

E  AA) 

.en+c 

Q(U)  ~  |ft>ol  +  I  ^H,Oted  I  ’ 

Notice  that,  in  contrast  to  the  precision,  the  quality  measure  does  not 
evaluate  a  single  diagnosis  with  respect  to  the  transitively  observed 
predictions,  but  assesses  a  hypothesis  (containing  possibly  multiple 
diagnoses)  on  the  basis  of  the  transitively  predicted  and  observed 
findings  and  the  unexplained  (isolated)  findings. 

We  see  that  g(Ti)  £  [0, 1]  for  any  hypothesis  TI  £  Qn'-  The  lower 
bound  0  is  obtained,  if  Tit>c  =  0.  The  upper  bound  1  is  obtained. 


if  all  predictions  are  fully  observed,  i.e.  Tlt.c  =  Tl>0  ,  and  the  set 

^r* isolated  rh 

•j  n,o  ~  V- 

Example.  For  the  covering  relation  given  in  Figure  [2]  the  set 

To  =  {  A3  :  V2,  A3  :v3,  A4  :V4,  A5  :  V5,  Ae  :ve  } 

of  findings,  and  the  hypothesis  Ti  =  {Z?i },  we  obtain  7t(D2)  =  1, 
n(D3)  =  1  (with  0(1)2)  =  0(1)3)  =  0.7).  Since  we  obtain  Ti+  = 
{ri,r2,r3}  for  hypothesis  Ti  we  can  calculate 

'Htc  =  {n,r2}, 

/t -isolated,  (a  a  "I 

Fn,o  =  {A2:V2,A3-.v3}. 

Up  to  now  we  presented  the  basic  representation  for  set-covering 
models  containing  diagnoses  and  findings  connected  with  cover¬ 
ing  relations.  Of  course  this  simple  representation  might  not  always 
meet  the  requirements  of  real  world  applications.  Therefore  we  will 
shortly  present  knowledge  extensions  of  set-covering  models.  In  J2| 
we  showed  how  to  apply  these  extensions  in  an  incremental  way. 

2.2  Extension  by  Similarities  and  Weights 

Similarities  between  findings  and  weights  for  states  provide  signifi¬ 
cant  knowledge  extensions  for  set-covering  models.  In  the  following 
we  will  show  how  to  include  these  enhancements  into  the  quality 
measures  given  above. 

Similarities.  Consider  a  parameter  A  with  the  domain 

dom(A)  =  {no,  si,  mi,  hi}, 

with  the  meanings  normal  (no),  slightly  increased  (si),  medium  in¬ 
creased  (mi),  and  heavily  increased  (hi),  where  A  :  hi  is  predicted. 
We  clearly  see  that  the  observation  A :  mi  deserves  a  better  precision 
than  the  observation  A :  no.  Nevertheless  the  simple  quality  measure 
considers  both  observations  as  unexplained  findings  and  makes  no 
difference  between  the  similarities  of  the  parameter  values.  For  this 
reason  we  want  to  define  similarities  as  an  extension  to  set-covering 
models. 

We  define  the  similarity  function 

sim  :  fly  x  fly  — >  [0, 1] 

to  capture  the  similarity  between  two  values  assigned  to  the  same 
parameter.  The  value  0  means  no  similarity  and  the  value  1  indicates 
two  equal  values.  In  cluster  analysis  problems  this  function  is  also 
called  distance  function  (cf.  GO). 

With  similarities  we  need  to  adapt  Equation  0  for  computing  the 
precision  of  findings. 

n[A:v)  =  sim(Valn(A),  Valjr0(A)), 

where  Val  returns  the  value  of  a  specified  attribute  contained  in  a 
specified  set  of  states. 

Val  :  2ns  x  nA  -►  fiv. 

If  no  special  similarity  is  included  in  the  model,  then  we  get  the  sim¬ 
ple  quality  measure  by  defining  the  default  similarity  sim(v,  v')  = 
5V:Vi,  where  5ViVi  =  1,  if  v  =  v' ,  and  Sv^vi  =  0,  otherwise. 

Weights.  The  introduction  of  weights  for  covered  states  is  another 
common  generalization  of  the  basic  covering  model.  Here  we  apply 


a  weight  function  w  :  fls  —>  TV+,  to  emphasize  that  some  states 
(findings  and  diagnoses)  have  a  more  significant  pathological  impor¬ 
tance  than  other  states. 

When  applying  weights  to  the  model  we  need  to  adapt  Equation  0 
which  calculates  the  precision  for  a  given  diagnosis: 


E  w(er)  ■  7r(er) 


7 r(D)  =  < 


E  ™(er) 

ren+ 


if  D+0^9, 


otherwise. 


Like  for  the  precision  of  a  diagnosis,  we  need  to  adapt  Equation  0 
to  calculate  the  quality  of  a  given  hypothesis: 


E  w(er)  ■  Tv(er) 

r€n>a 

0{n)  =  E  W(lr)+  E  W) 

r-e-H+o  F  £T%°lg‘d 

If  all  states  have  the  same  weight,  i.e.,  w(S)  =  1  for  all  S  £  fls, 
then  the  model  reduces  to  the  simple  covering  model. 

In  addition  to  similarities  and  weights  we  already  have  introduced 
uncertain  covering  relations  and  causal  effect  functions  as  possible 
extensions  (cf.  (2j). 


Then  the  weights  of  the  AND-connected  findings  F,  will  only  con¬ 
tribute  to  the  precision  of  D  if  all  of  these  findings  are  observed. 
If  not  all  findings  are  observed,  then  D  cannot  explain  the  findings 
and  we  have  to  check  if  another  diagnosis  from  the  hypothesis  can 
explain  these  observations.  All  remaining  findings  -  so  far  unex¬ 
plained  -  will  be  added  to  the  set  of  isolated  findings  T^°oted-  This 
will  decrease  the  quality  measure  for  the  current  hypothesis,  since 
Ttt.c  will  not  contain  relations  covering  the  unexplained  observa¬ 
tions.  Given  an  AND-covering  relation  of  the  form 

r  =  D  — >and  {  F\ ,  . . . ,  Fn  } 
we  define  for  each  Fi  £  {  F\ ,  . . . ,  Fn  } : 

.  „  .  f  7r(fi) ,  if  for  all  Fj  £  er  :  tt(Fj)  >  0 

Kr(Fi)  =  < 

I  0,  otherwise 

We  try  to  explain  all  findings  Fi  with  nr{Fi)  =  0  but  n (Fi)  >  0  by 
other  diagnoses  D'  £  Ti  \  { D }.  All  remaining  findings  Fi,  which 
cannot  be  explained  by  other  diagnoses  are  added  to  T^°oted. 

Example.  Assume  that  we  have  the  covering  model  of  Figure  [3] 
where  c(D)  =  0.5,  and  we  observe  the  set  To  = 

Then  n(F3)  =  0,  since  F$  is  not  in  To-  Therefore  not  all  preci¬ 
sions  of  the  AND-covered  findings  are  greater  than  0,  and  we  define 
nr(F2)  =  nr(Fs)  =  0.  We  obtain  T^°otei  =  {F2}  for  hypothesis 
H  =  { D }.  Notice,  that  F3  is  not  in  T^°oUd,  since  it  is  not  observed. 


3  Complex  Covering  Relations 

In  the  previous  section  we  introduced  the  basic  set-covering  model 
and  extensions  that  allow  for  the  refinement  of  set-covering  knowl¬ 
edge  build  with  basic  covering  relations.  In  this  section  we  propose 
some  further  extensions  of  the  representation,  And-,  Or-  and  [MlN, 
MAX]-relations. 

To  keep  the  interpretation  of  covering  models  simple,  we  only  al¬ 
low  these  extensions  for  covering  relations  between  diagnoses  and 
(directly  observable)  findings. 

3.1  Conjunction  of  Covering  Relations 

It  is  desirable  to  be  able  to  represent  conjunctions  between  covering 
relations.  An  And -covering  relation 

D  — >and  {  F 1 ,  ,  Fn  } 

denotes  the  characteristic,  that  all  covering  relations  D  F,  have 
to  be  fulfilled  simultaneously. 


Figure  3.  Covering  relation  D  — >and  {  F>,  F- } 


3.2  Disjunction  of  Covering  Relations 

We  also  can  express  alternative  covering  relations  with  disjunc¬ 
tion.  Here  we  can  distinguish  between  inclusive  (OR)  and  exclusive 
(Xor)  disjunctions. 

In  Figure  [4]  we  can  see  two  different  disjunctive  covering  relations 
for  diagnosis  D:  in  the  left  one  the  findings  F-2 ,  F3  are  connected 
with  the  OR-covering  relation  D  — >or  {F2,  F3},  whereas  at  the 
right  side  the  findings  are  connected  with  an  XOR-covering  relation 
D  — >xor  {^2,^3}-  These  OR/XOR-relations  state,  that  only  one 


Figure  4.  OR-/XOR-covering  relations. 


of  the  connected  finding  has  to  be  observed  to  fulfill  the  relation. 
Of  course  we  need  to  consider  the  different  semantics  in  covering 
models.  When  computing  the  quality  measures  we  have  to  take  the 
following  three  cases  into  account: 

1.  If  none  of  the  predicted  findings  is  observed,  then  nothing  has 
to  be  done.  The  covering  relations  connected  with  the  Or/Xor- 
condition  cannot  contribute  to  the  quality  measure  of  the  parent 
state. 


2.  If  one  of  the  predicted  findings  is  observed,  then  we  simply  cut 
all  other  states  connected  by  OR/XOR-relations  from  the  model. 
When  computing  the  quality  measure  we  only  take  the  observed 
finding  into  account. 

3.  If  more  than  one  of  the  predicted  findings  are  observed  (e.g. 
{F2,  F3}  C  Fo),  then  we  have  to  differentiate  between  Or  and 
Xor  relations.  For  both  we  take  the  finding  with  the  maximal  con¬ 
tribution;  e.g.  regarding  the  weighted  precision 

ttw(F)  =  n(F)  ■  w(F). 

For  OR-relations  we  simply  ignore  the  remaining  observations  for 
assessing  the  quality.  They  will  neither  contribute  to  the  quality  of 
the  hypothesis  nor  will  they  need  to  be  explained  by  other  diag¬ 
noses. 

For  XOR-relations  the  observations  left  over  still  have  to  be  ex¬ 
plained.  Like  for  the  AND-relations  we  try  to  explain  them  with 
the  other  diagnoses  contained  in  the  current  hypothesis.  All  re¬ 
maining  findings,  that  cannot  be  explained  by  other  diagnoses,  are 
added  to  the  set  of  isolated  findings  F£°oted. 

We  see  that  we  carefully  have  to  use  OR/XOR-relations,  because  of 
their  different  interpretation  of  the  observation.  For  example,  multi¬ 
ple  observations  of  one  XOR-covering  relation  are  taken  negatively 
into  account  (i.e.,  they  are  assumed  to  be  unexplained  findings  of  the 
current  hypothesis),  whereas  in  ordinary  OR-relations  they  will  not 
contribute  in  any  way. 

As  shown  for  AND-covering  relations  we  also  have  to  locally  define 
the  precision  for  OR/XOR-covered  findings  in  context  of  the  given 
diagnosis:  Consider  an  OR-relation  (analogous  for  Xor): 

r  =  D  — >or  {Fi,  . . . ,  Fn  } . 

We  select  a  finding  Fmax  €  {  Fi,  . . . ,  Fn  },  such  that  7Tw(Fmax)  = 
max[nw(Fi),  1  <  i  <  n).  Then  we  say  that 

7 Tr(Fi)  =  'f  Fi  =  FmaX 

1  0,  otherwise. 

If  there  is  more  than  one  Ft  with  maximum  weighted  precision 
it w{Fi),  then  all  but  one  (randomly  selected)  finding  will  set  to  the 
precision  nr(Fi)  =  0. 

When  we  compute  the  precision  7 t(D)  of  a  diagnosis  D,  then  the  pre¬ 
cisions  of  the  findings  F,.  that  are  covered  by  an  OR/XOR-covering 
relation  contribute  with  the  measure  nr(F,)  and  not  with  the  usual 
precision  measure  n(Fi). 

For  XOR-relations  we  have  to  explain  the  remaining  findings  by  other 
diagnoses  contained  in  the  hypothesis  or  add  them  to  F^°oted . 

3.3  Cardinalities  in  Covering  Relations 

Another  enrichment  of  the  set-covering  representation  is  the  connec¬ 
tion  of  covering  relations  by  cardinality  constraints.  We  express  such 
cardinalities  by  [Min,  Ma X]-covering  relations.  Consider  the  exam¬ 
ple  in  Figure  [5]  The  covering  relation  between  diagnosis  D  and  the 
findings  F\,  F2,  F3,  F4  and  F$  means,  that  between  2  and  4  of  the 
predicted  findings  have  to  be  observed.  We  denote  such  relations  by 

r  =  D  —>[2,4]  {  Fi,F2,F3,  Fa,  F5  }. 

When  we  interpret  [Min,  MAX]-relations  r  =D  — >[min,max]  F ,  then 
we  have  to  consider  three  possible  cases  for  the  number  k  =  \F  D 
Fo  |  of  relevant  findings: 


Figure  5.  A  [Min,  MAX]-covering  relation. 


1.  If  k  £  [Min,  Max],  then  all  findings  in  F  Cl  Fo  will  contribute. 

2.  If  k  >  Max,  then  let  Fmax  C  F  PI  Fo  be  the  Max  findings 
with  the  maximum  weighted  precisions  among  the  findings  in  F 
(i.e.  \Fmax\  =  Max).  We  explain  the  findings  in  Fmax  by  D. 
Then  we  try  to  explain  the  findings  in  (F  (T  Fo)  \  Fmax  by  other 
diagnoses  also  contained  in  the  hypothesis.  These  findings  [F  (T 
Fo)  \  Fmax ,  which  we  cannot  explain  by  other  diagnoses  D'  £ 
H\{D},  are  added  to  F%°gtei . 

3.  If  k  <  Min,  then  we  try  to  explain  all  findings  in  FC\  Fo  by  other 
diagnoses  D'  £  H  \  {D}.  Findings,  which  cannot  be  explained 
by  other  diagnoses,  are  added  to  F%°Qted. 

We  integrate  [Min,  MAX]-relations  into  set-covering  models  by  lo¬ 
cally  defining  the  precision  for  findings  connected  by  a  [Min,  Max]- 
relation  r  —D  ^[min.max]  F .  Then  we  say  that  for  each  F  £  F\ 

0,  if  k  <  Min 
or  if  k  >  Max  A  F  Fmax 
it(F'),  if  k  £  [Min,  Max] 
or  if  k  >  Max  A  F  £  Fmax 

where  Fmax  is  again  the  set  of  the  Max  findings  with  the  best 
weighted  precisions  among  the  findings  in  F . 

When  calculating  the  quality  measure  for  a  diagnosis  or  hypothesis 
we  apply  the  precision  nr(F)  for  all  findings  F  connected  by  the 
relation  r.  Findings  F  with  7 tr(F)  =  0  but  7r (F)  >  0  need  to  be 
explained  by  other  diagnoses  contained  in  the  hypothesis  or  will  be 
added  to  F%°gted. 

It  is  worth  mentioning  that  ordinary  covering  relations  for  a  diagno¬ 
sis  are  following  a  similar  concept,  since  we  only  will  consider  pre¬ 
dicted  findings  that  are  also  observed  but  not  all  predicted  findings  of 
the  diagnosis.  But  as  opposed  to  [Min,  MAX]-relations  all  observed 
predictions  will  contribute  to  the  quality.  In  [Min,  MAX]-relations 
only  Max  observed  findings  will  contribute;  more  than  Max  find¬ 
ings  have  to  be  explained  by  other  diagnoses.  In  general,  an  ordinary 
covering  model  for  a  diagnosis  D  with  n  covered  findings  is  compa¬ 
rable  to  a  [c(D)  ■  n,  n]-relation  connecting  the  n  findings. 

3.4  Bounded  Covering  Relations 

The  introduction  of  similarities  for  finding  values  is  a  useful  knowl¬ 
edge  extension.  Nevertheless  in  some  situations  the  expert  wants  to 
express  that  a  relation  is  only  fulfilled  if  a  covered  parameter  is  ob¬ 
served  with  exactly  the  predicted  value,  rather  than  a  similar  value. 
Therefore  we  supplement  necessary  covering  relations,  disjunctive, 


conjunctive  and  constrained  covering  relations  with  the  optional  la¬ 
bel  bounded.  We  obtain  the  required  behaviour  by  locally  defining 
the  default  similarity  measure  for  bounded  relations: 

sim{Valn(A),  Valjr0(A))  =  5Vain(A),  VaiTo  (A)- 

I.e.,  only  if  a  parameter  A  is  observed  with  the  predicted  value,  then 
1  is  assigned  to  its  precision. 

4  Constraints  for  Hypothesis  Generation 

As  mentioned  in  the  introduction  of  Section  [2]  the  problem  of  hy¬ 
pothesis  generation  is  exponential,  since  for  n  diagnoses  we  need  to 
consider  about  2n  hypotheses  in  the  worst  case  for  an  observation. 
In  the  following  we  want  to  sketch  some  heuristics  to  restrict  the  hy¬ 
pothesis  space. 

In  a  first  step,  we  will  filter  all  diagnoses  D  £  Qt>,  that  are  rele¬ 
vant,  i.e.  having  the  minimum  precision.  For  this,  we  define  the  set 
of  relevant  diagnoses 

fig'  =  {D  £  nD  |  n(D)  >  c(D)}. 

Then,  only  diagnoses  D  £  will  be  taken  into  account,  when 
generating  hypotheses.  Before  describing  concepts  to  shrink  the  set 
of  hypotheses,  we  will  define  generators  as  a  compact  representation 
for  sets  of  hypotheses,  which  had  been  introduced  by  Reggia  et  al. 

m 

Definition  4.1  (Generator)  A  generator  Qi  =  {Gi, . . . ,  G„}  con¬ 
sists  of  non-empty  pairwise-disjoint  subsets  G;  C  OAf}1  The  hypothe¬ 
ses  Tictj  generated  by  Qi  is  defined  as 

nSl  =  {  H  C  Q.V  |  \n  n  G;|  <  1,  for  all  1  <  i  <  n  }. 

For  Qi  =  0,  it  holds  that  7 i.gI  =  {0}.  We  can  see,  that  TlgI  is 
analogous  to  a  cartesian  set  product. 

For  example,  for  the  set-covering  model  defined  in  Figure  |T]  and 
To  =  {temp  :  inc,  skin  :  sweat,  nose  :  red},  we  obtain  Q  = 
{Qi,Q2}  with  Qi  =  { {cold},  {fever}}  and  Q2  =  {{flu}}.  So  we 
can  compute  Tig  =  {0,  {cold},  {fever},  {cold,  fever},  {flu}}  to  be 
the  set  of  interesting  hypotheses. 

A  method  for  computing  and  updating  generator  sets  is  extensively 
described  in  0.  Generators  are  used  to  efficiently  generate  hypothe¬ 
ses  in  an  incremental  manner:  In  a  first  step,  sets  of  generators  de¬ 
scribing  higher  level  diagnoses  (concepts)  are  created.  For  hypothe¬ 
ses  containing  higher  level  diagnoses  and  having  a  high  quality  mea¬ 
sure,  we  build  sets  of  generators  containing  underlying  specialized 
diagnoses  and  test  them  with  their  corresponding  quality  measure. 
In  the  following,  we  introduce  two  basic  knowledge  extension,  that 
additionally  shrink  the  space  of  generated  hypotheses. 

4.1  Exclusion  Constraints 

We  can  define  exclusion  constraints  to  filter  diagnoses  from  the  pro¬ 
cess  of  hypotheses  generation.  In  general,  two  kinds  of  constraints 
are  possible: 

-'(£>  A  Fi  A  ■  ■  •  A  Fn) 

If  findings  F±, ...  ,Fn  are  observed,  then  remove  generated  hy¬ 
potheses,  containing  diagnosis  D. 


-i(T?r  A  •  •  •  A  Dm) 

Remove  generated  hypotheses,  containing  all  the  diagnoses 
D] , . . . ,  Dm  at  the  same  time. 

Thus,  we  create  hypotheses  using  generator  sets  and  check  each  gen¬ 
erated  hypothesis  against  the  available  exclusion  constraints.  If  one 
exclusion  constraint  evaluates  true,  the  hypothesis  is  discarded. 

It  it  worth  noticing,  that  the  modification  of  generator  sets  with  re¬ 
spect  to  exclusion  constraints  yields  a  combinatorial  size  of  gener¬ 
ators  and  therefore  is  not  reasonable.  An  evaluation  of  the  gener¬ 
ated  hypotheses  according  to  existing  exclusion  constraints  has  been 
proven  to  be  more  efficient. 

4.2  Necessary  Covering  Relations 

A  stronger  type  of  covering  relations  are  necessary  covering  rela¬ 
tions.  A  necessary  covering  relation  between  a  diagnosis  D  and  a 
finding  F]  means,  that  D  necessarily  covers  F\  and  that  F\  always 
has  to  be  observed  if  D  is  hypothesized.  We  depict  a  necessary  cov¬ 
ering  relation  with  D  >  Fi  as  shown  in  Figure]^ 


Figure  6.  Necessary  Covering  relation  for  a  diagnosis  D. 

For  applying  necessary  covering  relations  we  introduce  an  adapted 
definition  of  the  precision  Tvnec  for  each  diagnosis  D  £  Qt>  '■ 

!0,  if  3  r  £  f In  :  r  =  D  -^3  F  with 
F  £  Qj -  A  7 t(F)  <  r 
7 r(.D),  otherwise 

where  t  £  [0, 1]  is  a  specified  threshold,  which  defines  when  a  find¬ 
ing  is  sufficiently  observed  (e.g.  r  =  0.8). 

Therefore  a  diagnosis  D  does  not  propagate  any  contribution  to  its 
parent  states  until  all  necessarily  covered  findings  are  (sufficiently) 
observed.  Consequently,  D  will  not  appear  in  any  generator  and  thus 
will  not  be  included  in  any  hypothesis. 

5  Conclusions  and  Future  Work 

After  describing  the  basic  structures  of  set-covering  relations  we 
have  shown  how  to  enrich  the  model  with  additional  knowledge  like 
similarities  or  weights.  We  also  considered  the  computation  of  qual¬ 
ity  measures  of  these  parts.  Furthermore,  we  have  shown  represen¬ 
tational  extensions  to  the  set-covering  model  to  facilitate  necessary, 
disjunctive,  conjunctive  or  constrained  covering  relations.  An  impor¬ 
tant  characteristic  of  all  these  extensions  is  the  incrementality:  some 
enhancements  can  be  added  to  refine  special  aspects  of  the  model 
but  will  not  change  its  basic  semantics;  others  are  used  to  guide  the 
process  of  candidate  generation. 


In  the  future  we  are  planning  to  work  on  the  following  fields:  In¬ 
cremental  development  requires  restructuring  the  model  from  time 
to  time.  We  are  currently  working  on  restructuring  methods  for  set¬ 
covering  models  that  do  not  alter  the  basic  semantics  but  improve  the 
design  of  the  diagnosis  knowledge.  In  software  engineering  refactor¬ 
ing  1 101 1 1 II  has  been  emerged  as  the  corresponding  method.  In  gen¬ 
eral  we  have  to  look  at  validation  techniques  for  set-covering  mod¬ 
els  besides  simple  case  testing.  Because  of  the  special  structure  of 
the  model  we  alsoj  have  to  consider  static  verification  techniques  for 
the  set-covering  representation.  For  a  survey  in  this  field  we  refer  to 

munHHini. 

In  this  paper  we  presented  a  hand-driven  development  of  set-covering 
models.  But  it  seems  to  be  possible  to  learn  coarse  models  automat¬ 
ically  from  a  small  number  of  available  cases.  Later  on  these  models 
should  be  refined  by  the  developer  with  additional  knowledge.  With 
such  a  semi-automatic  development  step,  the  initial  costs  of  knowl¬ 
edge  acquisition  can  be  reduced  conveniently.  Some  work  in  this  field 
has  been  done  by  Thompson  et  al.  01  and  Wang  et  al.  1771  .  This  step 
is  not  considered  if  we  have  a  sufficiently  large  set  of  data,  since  then 
traditional  machine  learning  methods  (e.g.  learning  neural  networks, 
learning  Bayes  networks)  seem  to  be  more  appropriate. 
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Computing  Minimal  Conflicts 
for  Rich  Constraint  Languages 


Jakob  Mauss1  and  Mugur  Tatar1 


Abstract.  We  address  here  the  following  question:  Given  an 
inconsistent  theory,  find  a  minimal  subset  of  it  responsible  for  the 
inconsistency.  Such  conflicts  are  essential  for  problem  solvers  that 
make  use  of  conflict-driven  search  (cf.  [2,  4,  9]),  for  interactive 
applications  where  explanations  are  required  (cf.  [16,  22]),  or  as 
supporting  tools  for  consistency  maintenance  in  knowledge-bases 
(cf.  [11]).  Conflict  computation  in  AI  applications  was  usually 
associated  with  dependency  recording  as  performed  by  TMSs  (cf. 
[2,  3,  18]).  This  techniques,  however,  have  a  rather  limited 
applicability  for  languages  that  go  beyond  the  expressiveness 
power  of  propositional  logic.  For  more  powerful  languages  and 
solvers  constraint  suspension  appeared,  until  now,  to  be  the  only 
available  alternative  for  the  computation  of  minimal  conflicts. 

We  present  here  an  algorithm  for  computing  minimal  conflicts 
that  can  be  used  with  powerful  constraint  languages,  e.g.  possibly 
including  finite  and  non-finite  variable  domains,  algebraic  and  FD 
constraints,  etc.  The  conflicts  are  extracted  post  mortem  from  the 
proof  (a  tree  with  inferences  of  the  form  A  a  B  =>  C)  that  lead  to 
the  derivation  of  the  inconsistency  by  an  informed  search  that 
computes  and  generalizes  conflicting  relations.  The  algorithm  is 
based  on  a  simple  but  powerful  principle  that  allows  to  recursively 
decompose  the  minimization  problem  into  smaller  sub-problems. 
This  principle  can  also  lay  the  foundation  for  efficient  constraint 
suspension  algorithms  that  can  be  used  in  case  no  intermediary 
results  are  cached  during  the  constraint  solving,  i.e.  in  case  no 
proof  structures  are  available. 

1  INTRODUCTION 


systems  of  equations  where  local  value  propagation  is  not  enough 
for  solving,  TMS-based  architectures  usually  become  a  heavy 
machinery  that  consumes  considerable  amounts  of  time  and 
memory  (see  also  [17])  and,  in  the  end,  still  do  not  have  any 
guarantees  for  conflict  minimality  -  the  minimality  is  (at  most) 
guaranteed  with  respect  to  the  propositional  clauses  that  represent 
the  dependencies  and  not  with  respect  to  the  semantic  of  the 
original  constraint  language.  The  following  example  is  an  attempt 
to  illustrate  this.  Consider  a  system  of  five  algebraic  constraints 

Aj=x>4  A3  -  y  >  2  A5  =  x>2y+1 

A2  =  x  <  5  A4  =  y  <  2 

A  solver  may  process  these  constraints  in  4  steps  as  shown  in 
Figure  1.  In  step  ©,  they  are  discovered  inconsistent.  A  minimal 
conflict  among  the  given  constraints  is  {  A2,  A3,  A5  }.  If  the  solver 
were  using  dependency  recording  it  would  not  find  the  above 
minimal  conflict  -  just  the  trivial  { Ab  A2,  A3,  A4,  A5]  in  this  case! 


For  problems  expressed  using  propositional  logic  or  using  finite- 
domain  (FD)  constraints  there  exist  some  efficient  solutions  for  the 
computation  of  conflicts  and  explanations  (cf.  [13,  16,  18]). 
Unfortunately,  this  is  not  the  case  for  more  expressive  constraint 
languages.  Due  to  the  scope  of  our  application  interests,  namely 
supporting  engineering  tasks  such  as  safety  and  diagnosability 
analysis  and  also  design  and  configuration  (cf.  [12,  15,  20,  22]), 
we  are  especially  interested  in  modeling  languages  adequate  for 
engineering  problems.  Such  languages  have  to  mix  logical  and  FD 
constraints  with  (more  or  less)  classical  systems  of  linear  and  non¬ 
linear  algebraic  or  even  differential  equations.  The  general  purpose 
techniques  that  can  be  applied  in  this  case  for  the  (minimal) 
conflict  computation  are  constraint  suspension  (cf.  [7])  and  TMS- 
like  dependency  recording  (cf.  [3]).  Constraint  suspension  can 
guarantee  conflict  minimality,  but  it  is  in  many  cases  too 
inconvenient  due  to  the  large  amount  of  time  required  to 
recompute  many  subsets  of  the  initial  problem.  When  applied  to 
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Figure  1.  Tree  for  proving  the  inconsistency  of  5  constraints. 

Of  course,  this  was  a  just  simple  example  where  no  symbolic 
variable  elimination  was  required  and,  for  the  above  example,  one 
can  easily  define  a  strategy  to  handle  correctly  the  conflict 
computation  -  for  instance  by  maintaining  separate  dependencies 
for  lower  and  upper  bounds  of  intervals  as  in  [6],  However,  this 
unnecessarily  overloads  the  solving  process  in  case  of  consistency 
and,  still,  would  not  solve  the  problem  in  general. 

In  contrast,  the  key  idea  of  this  paper  is  to  do  a  (guided)  post 
mortem  analysis  of  the  context  in  order  to  compute  the  minimal 
conflicts.  The  algorithm  uses  the  information  that  A345  is 
conflicting  with  A12  (we  say  that  A345  is  a  conflicting  relation  for 
A12)  and  propagates  and  updates  these  conflicting  relations 
through  the  proof  tree  in  order  to  select  only  those  parts  of  it  that 
are  really  contributing  to  the  conflict.  The  paper  is  organized  as 
follows:  in  section  2  we  present  the  basic  procedure  for  extracting 
a  minimal  conflict  from  a  binary  proof  tree.  In  section  3  we 
describe  how  a  constraint  solver  can  control  the  inferences  in  order 
to  easily  provide  such  trees.  In  section  4  we  report  some  first 
empirical  results  regarding  the  performance  of  the  algorithms. 
Section  5  concludes  the  paper  with  a  comparison  to  related  work. 


2  COMPUTING  MINIMAL  CONFLICTS 


We  assume  in  the  following  a  relational  framework,  i.e.  constraints 
are  noted  as  relations  over  variables  with  finite  or  continuous 


domains.  These  relations  may  be  represented  extensionally  (as  in 
Figure  4),  or  intensionally  using  formulas  (as  in  Figure  1).  In 
relational  terms,  ‘a’  represents  the  join  (intersection)  of  relations, 
falsity  is  represented  by  the  empty  relation,  and  the  implication 
A  =>  B  is  interpreted  as  subset  relation  A  c  B.  A  set  of  constraints 
forms  a  conflict  if  it  is  not  satisfiable,  i.e.  in  relational  terms,  if  the 
join  of  the  relations  representing  the  constraints  is  the  empty 
relation.  Given  an  initial  set  of  inconsistent  constraints,  we  are 
interested  in  extracting  a  minimal  conflict,  i.e.  a  minimal  subset 
that  is  still  inconsistent.  Of  course,  there  can  be  more  than  one 
minimal  conflict  in  an  inconsistent  context,  but  we  focus  for  the 
moment  on  finding  just  one  such  minimal  conflict.  In  the 
following,  we  show  how  to  extract  a  minimal  conflict  from  a 
binary  proof  tree  such  as  the  one  shown  in  Figure  2.  The  initial 
constraints  appearing  as  leaves  in  the  proof  tree  are  also  called 
assumptions  in  the  following. 


E  falsity 
□  conclusion 
■  assumption 


Figure  2.  Tree  proving  the  inconsistency  of  1 1  assumptions. 


Assume  that  we  have  two  conflicting  relations  A,  B,  none  of  them 
being  empty,  i.e.  A  ^  _L,  B  ^  _L,  and  A  a  B  =>  X.  Then  we  have  to 
consider  two  cases. 


1.  A  and  B  are  both  assumptions.  In  this  case,  {A,  B}  is  the 
minimal  conflict. 


2.  At  least  one  of  A,  B  is  not  an  assumption.  Assume  without  loss 
of  generality  that  A  has  been  derived  from  At  and  A2,  i.e. 
A|  a  Ai  =>  A.  Let  now:  Ci  :=  At  a  B  and  C2  :=  A2  a  B.  We  can 
then  distinguish  4  cases,  as  shown  in  Figure  3. 


Figure  3.  Four  cases  distinguished  by  computing  intersections. 
Each  relation  is  depicted  as  a  set  of  variable  assignments. 


2.1:  Ct  =  _L  a  C2  7^  _L  In  this  case,  the  assumptions  leading  to  the 
derivation  of  A2  do  not  contribute  to  the  conflict  with  B. 
Consequently,  we  can  prune  the  whole  sub-tree  A2  and 
continue  the  conflict  search  in  A,. 

2.2:  Ci  ^  _L  a  C2  =  -L  Analog  to  case  2.1.  Ai  can  be  ignored. 

2.3:  Ci  =  _L  a  C2  =  -L  There  are  at  least  two  independent 
conflicts  with  B,  at  least  one  in  the  sub-tree  Ap  and  at  least  one 
in  A2.  If  we  want  to  find  just  one  conflict  then  we  can  non- 
deterministically  decide  to  skip  one  of  the  sub-trees. 

2.4:  Ci  ^  _L  a  C2  ^  -L  All  minimal  conflicts  are  spread  across 
both  sub-trees.  A  minimal  conflict  has  to  be  composed  from  a 
partial  solution  retrieved  from  the  sub-tree  At  and  an 
appropriate  completion  retrieved  from  the  sub-tree  A2.  If  B  was 
a  conflicting  relation  for  A,  then  Ci  is  a  conflicting  relation  for 
A2  and  C2  is  a  conflicting  relation  for  Ap  With  these  new 
conflicting  relations  we  can  descend  recursively  in  the  Ai,  A2 
sub-trees  and  collect  the  sub-conflicts. 


This  case  analysis  leads  to  the  following  procedure  for  extracting  a 
minimal  conflict  from  a  proof  tree. 

Specification:  Let 

R  be  a  non-empty  set  of  relations  (assumptions  ) 

A  the  root  of  a  binary  proof  tree  with  the  leaves  given  by  R 

B  a  conflicting  relation  for  A,  i.e.:  B  X  and  AaB  =  1. 

The  proof  tree  satisfies  the  requirement  that,  for  any  non-leaf  node 
A:  left(A)  a  right(A)  =>  A. 

The  procedure  XC1(A,  B)  returns  one  minimal  and  non-empty 
set  M  c  R  such  that  (aM)  a  B  =  1.  As  a  consequence,  if  A  is  the 
root  of  a  refutation  tree  then  XC1(A,  T)  returns  a  minimal  conflict 
from  the  tree  -  where  T  represents  the  universal  relation  i.e.  the 
complement  of  X. 

XC1 (A,  B) 

®  if  (isLeaf(A))  return  {  A  } 

Ai  <—  left  (A) 

A2  <—  right  (A) 

Ci  ^ —  Ai  A  B 

C2  t —  a2  a  b 

©  if  (Ci  =  _L  and  C2  ^  X)  return  XCl(AlfB) 

©  if  (Ci  ^  X  and  C2  =  X)  return  XC1(A2,B) 

®  if  (Ci  =  X  and  C2  -  X)  return  XCl(Ai,B) 

or  return  XC1(A2,B) 

©  if  (Ci  *  X  and  C2  *  X) 

Mi  <-  XC1  (Ai,  C2) 

M2  <-  XC1(A2,  (AMi)  A  B) 
return  Mi  u  M2 

In  case  @  the  procedure  first  descends  in  the  sub-tree  A|  with  C2 
as  conflicting  relation.  Before  it  descends  in  the  sub-tree  A2  we, 
however,  have  to  generalize  At  to  aM|  and  Ci  to  (aM|)  a  B.  This 
is  necessary  in  case  we  have  several  minimal  conflicts  that  span 
over  the  sub-trees  A|  and  A2  in  order  to  select  from  A2  a  sub¬ 
conflict  that  is  part  of  the  same  conflict  as  the  sub-conflict  that  was 
non-deterministically  chosen  (case  ©)  from  the  sub-tree  Ap  Such  a 
case  is  also  illustrated  by  the  following  example. 

Example  Consider  the  set  R  =  {Ap  ...A5}  shown  in  Figure  4. 
The  constraints  are  extensionally  defined  relations  in  this  example. 
E.g.  Ai  =  ’(x  =  a  a  y  =  1)  v  (x  =  b  a  y  =  0)'.  R  is  inconsistent, 
actually  it  contains  two  minimal  conflicts.  Figure  5  shows  how 
XC1  computes  one  of  them.  Circled  numbers  correspond  to  the 
five  cases  marked  in  the  pseudo  code  above. 


Figure  4.  A  proof  tree  for  R  =  {Ai,  A2.  A3,  A4,  A5} 

The  crucial  part  of  the  procedure  is  handled  in  case  ©,  where  a 
minimal  conflict  is  composed  as  a  disjoined  union  of  two  sets  Mi 
and  M2  computed  using  the  left  and  right  sub-tree.  Note  that  the 
second  set  M2  depends  on  the  first  set  Mp  During  the  recursive 
call  at  A6  the  procedure  non-deterministically  decides  to  select  the 
conflict  containing  Ap  This  decision  is  reflected  in  the  arguments 
of  the  succeeding  call  at  A7  in  order  to  select  the  right  sub-conflict 
-  i.e  A3  and  not  A4  which  could  be  erroneously  selected  if  we  did 
not  update  the  conflicting  relation  for  A7! 


Some  properties  of  XC1  that  are  worth  discussing  are: 

(1)  During  top-down  traversal  of  the  proof  tree,  only  direct  fathers 
of  the  nodes  contained  in  the  returned  minimal  conflict  are  visited. 
Sub-trees  not  involved  in  the  minimal  conflict  are  pruned  without 
investigating  their  nodes.  The  worst-case  appears  when  the  pruning 
is  not  effective  and  we  have  to  inspect  the  whole  tree  (always  in 
case  ©).  For  a  tree  with  n  leaves  there  are  no  more  than  4(n-l ) 
joins  for  the  worst  case  (see  also  the  incremental  computation  of 
aM|  later  on).  However,  the  complexity  of  the  conflict 
minimization  crucially  depends  on  the  complexity  of  the  basic  join 
operations. 


Figure  5.  Trace  of  computation  of  a  minimal  conflict 

(2)  An  inference  engine  will  be  unable  in  general  to  provide 
complete  implementations  of  the  join  and  empty-check  operations 
-  for  instance  in  case  we  are  dealing  with  systems  of  non-linear 
equations.  When  used  in  conjunction  with  a  correct  but  incomplete 
inference  engine,  XC1  may  return  a  non-minimal  conflict.  The 
conflict  minimallity’  is  only  relative  to  the  completeness  degree  of 
the  inference  services  supplied  by  the  solver. 


Figure  6.  Two  minimal  conflicts  (B,  D)  and  {A,  C,  D) 

(3)  The  procedure  can  easily  be  extended  to  return  several  minimal 
conflicts  instead  of  only  one.  Basically,  in  case  ©,  one  can 
continue  search  in  both  sub-trees,  instead  of  non-deterministically 
choosing  one  of  them.  However,  this  simple  extension  of  XC1  will 
not  always  return  all  of  the  minimal  conflicts.  See  Figure  6  for  an 
example.  The  second  conflict  {A,  C,  D}  is  missed,  when  using  the 
given  proof  tree.  Anyway,  the  computation  of  all  minimal  conflicts 
from  a  context  can  require  significantly  more  effort  and  is  seldom 
justified  in  practice. 


(4)  There  are  several  obvious  improvements  of  the  efficiency  of 
XC1  as  given  above.  If  the  proof  structure  is  a  tree  then  M,  u  M2 
can  be  computed  as  a  disjoined  union  in  case  ©.  If  case  ®  is 
always  mapped  to  (say)  case  ©  then  the  computation  of  C2  is 
required  only  if  Q  ^  _L.  The  repeated  a  computations  aM|  can  be 
avoided  if  XC1,  in  addition  to  returning  the  set  M,  also  returns  the 
join  aM,  which  allows  for  an  incremental  computation  of  the 
conjunction  in  case  ©.  Moreover,  the  generalization  of  the 
conflicting  relation  C[,  i.e.  (aMJ  a  B  in  case  ©,  is  required  only 
if  it  is  a  strict  generalization,  i.e.  if  Mi  is  a  strict  subset  of  the 
leaves  of  A|. 

3  DERIVING  PROOF  TREES 

In  the  previous  section  we  have  seen  how  to  extract  one  or  several 
minimal  conflicts  from  a  proof  structure.  In  this  section  we  sketch 
how  a  constraint  solver  operating  on  a  set  R  of  input  relations  can 
control  the  inference  in  order  to 

1.  check  whether  R  is  consistent,  i.e.  whether  aR  ± 

2.  solve  R  for  any  variable 

3.  provide  the  proof  structure  required  for  conflict  computation. 
Conflict  computation  using  XC1  works  however  with  any  well- 
formed  proof  structure,  irrespective  if  the  proof  was  generated  by  a 
solver  like  the  one  described  in  this  section  or  not2. 

We  note  with  V(A)  the  set  of  variables  constrained  by  a  relation 
A.  7t(A,  X)  denotes  the  projection  of  A  onto  a  variable  set  X.  The 
projection  Jt(A.  X)  results  from  eliminating  all  variables  V(A)  \  X 
from  the  relation  A.  For  example,  if 

A  =  x2  +  y2  <  1  B  =  (x-1  )2  +  y2  <  1 

then  ji(AaB,  {x})  =  D<xax<1’ 

and  7I(AaB.  (yj)  =  -1  <yAy<  T 

The  projection  operation  is  an  abstraction  (generalization) 
operation,  i.e.  A  =>  Jt(A,  X).  Hence,  the  computation 
C  :=  Jt(A  a  B,  X)  can  be  seen  as  an  inference  of  the  form 
A  a  B  =>  C.  We  call  such  an  inference,  i.e.  projecting  the  join  of 
two  relations  A  and  B  onto  a  variable  set  X,  an  aggregation. 

The  computed  proofs  will  contain  aggregations  as  the  only  kind 
of  inference.  The  proof  structures  will  be  used  to  derive  minimal 
conflicts,  or  minimal  explanations  of  variable  solutions. 

The  consistency  check,  may  seem  trivial  to  specify.  We  could 
simply  ask  the  solver  to  compute  aR  to  check  whether  aR  _L. 
However,  in  the  practical  applications  with  which  we  are 
commonly  confronted.  R  may  contain  hundreds  of  algebraic  and 
logical  constraints  with  thousands  of  variables.  In  this  case,  the 
intermediate  relations  created  during  the  computation  of  aR  would 
be  huge.  Instead,  following  [1],  after  computing  a  single 
conjunction  A  a  B,  we  eliminate  all  those  variables  from  the  result 
that  do  not  occur  in  the  remaining  relations.  Consequently,  the 
intermediate  relations  remain  'small'  -  the  size  depending,  of 
course,  on  the  degree  of  connectivity  of  the  constraint  network. 
This  works  fine,  as  long  as  a  variable  is  shared  by  a  relatively  small 
number  of  constraints.  If  the  connectivity  degree  increases  (cf. 
induced  width  w*  in  [5]),  then  many  of  the  aggregations  degrade 
to  simple  joins  and  the  approach  is  likely  to  become  inappropriate. 


Such  proof  structures  can  be  recovered,  for  instance,  also  from  the 
well-founded-support  recorded  by  a  TMS  (cf.  [18])  -  in  which  case  XC1 
could  be  used  for  further  conflict  minimization  (recall  that  a  TMS 
guarantees  minimality  only  with  respect  to  propositional  dependencies  and 
not  with  respect  to  the  more  expressive  constraint  language). 


The  creation  of  a  proof  tree  for  the  consistency  check  is  given  by 
the  following  procedure. 

Specification:  Let  R  be  a  non-empty  set  of  assumptions,  1  £  R. 
Then  the  procedure  isConsistent(R)  returns  true,  iff  R  is  satisfiable. 

isConsistent (R) 

if  (  | R | =  1  ) 

return  true 
else 

choose  {A,  B)  c  R 
S  <-  R  \  (A,  B} 

X  <—  (vars  (A)  u  vars  (B)  )  n  (U  vars  (S)  ) 

C  <-  71  (A  A  B,  X) 

if  (C  =  _L)  return  false 

return  isConsistent ( S  U  {  C  1) 

vars (A) 

if  (A  is  a  leaf) 
return  V (A) 
else 

return  vars ( left (A) )  u  vars (right (A) ) 

Obviously,  the  procedure  isConsistent  computes  a  proof  tree 
containing  aggregations  as  the  only  kind  of  inference.  Therefore, 
we  call  this  an  aggregation  tree.  If  A  is  the  root  of  an  aggregation 
tree  for  an  inconsistent  set  R  of  assumptions  then,  as  shown  in 
section  2,  XC1(A,  T)  returns  a  minimal  conflict.  To  keep  the 
conflicting  relation  B  small,  we  may  add  a  projection  step 
B  <r-  Jt(B.  vars(A))  as  first  instruction  in  XC1.  The  strategy  used 
to  choose  a  pair  of  relations  for  aggregation  may  for  example 
minimize  the  variable  set  X,  or  try  to  achieve  a  balanced  tree. 

For  checking  the  consistency  of  n  assumptions,  isConsistent 
computes  n  -  1  aggregations.  A  significant  feature  of  proof  trees  as 
derived  above  is  their  ability  to  support  incremental  context 
analysis.  Assume  we  have  performed  a  consistency  check  for  a  set 
R  of  n  assumptions,  and  we  want  to  analyze  a  second  context  R’, 
constructed  by  replacing  an  assumption  A  in  the  proof  tree  for  R 
by  a  new  assumption  B  with  the  same  variable  set.  In  order  to 
check  the  new  context  Ru  (B)  \  (A),  we  only  have  to  re-compute 
the  inferences  on  the  path  from  A  to  the  root  of  the  proof  tree,  i.e. 
if  the  proof  tree  is  balanced,  we  only  have  to  compute  log(n) 
aggregations.  As  we  see  next,  the  computation  of  variable 
solutions  can  be  performed  using  aggregations  as  well. 

Specification:  Let  R  be  a  non-empty  consistent  set  of 
assumptions,  and  A  the  root  of  an  aggregation  tree  computed  by 
the  procedure  isConsistent.  Then  the  procedure  solve(A,  T) 
computes  for  every  variable  x  in  R  the  solution  S[x]  :=  7t(AR,  {x}). 

solve (A,  B) 

if  (A  is  a  leaf)  return 
B  <—  7t  (B,  vars  (A)  ) 

Ai  <—  left  (A)  ;  A2  <—  right  (A) 

A12  <-  Ai  a  A2 

X  <—  vars(Ai)  u  vars  (A2)  \  vars  (A) 
for  each  x  e  X 

S[x]  K  (  Ai2  a  B,  {x}) 
solve  (Ai,  Ai2  a  B) 
solve  (A2,  Ai2  a  B) 

If  we  take  a  closer  look  at  the  procedure  solve,  we  note  that  each 
S[x]  is  the  root  of  a  proof  structure  defined  by  a  sequence  of 
aggregations.  In  this  case  the  proof  is  not  purely  a  tree,  it  is 
actually  a  DAG  because  some  nodes  are  used  more  than  once. 
Still,  the  proof  is  well-formed,  i.e.  there  are  no  cyclic  justifications. 
XC1  can  be  modified  to  cope  with  the  DAG  structure.  The 
resulting  procedure  XEl(S[x],  — iS[x])  returns  a  minimal 
supporting  set  of  assumptions  for  the  solution  of  x,  i.e.  a  minimal 
subset  EcR  such  that  S[x]  =  ti(aE,  (xj). 


4  APPLICATION  AND  EMPIRICAL  RESULTS 

We  have  recently  finished  a  prototype  implementation  of  a 
Relational  Constraint  Solver  (RCS)  that  follows  the  principles 
described  in  this  paper,  including  the  computation  of  explanations 
and  conflicts.  RCS  is  already  integrated  in  our  environment  for 
engineering  knowledge  management,  and  its  integration  in  MDS 
[12]  is  planned  to  follow. 

In  this  section  we  compare  XC1  with  the  conflict  computation 
based  on  naive  constraint  suspension.  Let  R  be  an  inconsistent  set 
of  relations.  Then  the  procedure  MC(R,  { })  returns  a  minimal 
conflict,  computed  by  constraint  suspension. 

MC(R,  M) 

if  R  =  { 1  return  M 
else  choose  A  e  R 

if  A (R  u  M  \  {A} )  =1 

return  MC (R  \  {A},  M) 
else  return  MC  (R  \  (A),  M  CJ  {A}) 

The  procedure  MC  resembles  Junker’s  RobustXplain  [8],  which 
may  use  a  trailing-mechanism  not  described  in  [8]  to  perform 
incremental  (i.e.  fast)  consistency  checking.  If  IRI  =  n  and  a 
consistency  check  for  R  requires  n  aggregations,  then  MC(R,  { }) 
needs  0(n2)  aggregations  for  computing  a  minimal  conflict.  In 
contrast,  XC1  requires  only  O(n)  aggregations  for  the  same  task, 
given  an  arbitrary,  not  necessarily  balanced  tree.  Our 
implementation  of  MC  uses  an  incremental  consistency  check  as 
explained  in  section  3  -  thus,  a  check  requires  only  0(log(n )) 
instead  of  O(n)  aggregations  in  the  best  case. 

S0  S1  s2  s3  S4  s5  S6  S7 

C,  ^  C,  ^  C3  ^  C4  ^  Cg  ^  Cg  ^  C- 

c0  -  VA0  —  VA,  —  V A2  —  V A3  —  VA4  —  VA5  —  VA6  —  VA7  -  c8 

h — r  h — r  h — r  h — r  h — r  h — r  h — r  h — r 

x0  y0  xi  yi  x2  y2  x3  y3  x4  y4  =%  y5  xs  y3  x7  y? 

Figure  7.  An  8-bit  full  adder 

For  the  empirical  comparison,  we  used  a  set  R  of  137  relations, 
representing  eight  1-bit  full  adders  connected  in  series  as  shown  in 
Figure  7,  and  the  assignments  c0  =  1,  and  for  0  <  k  <  7:  xk  =  0, 
yk  =  1 .  If  we  add  one  more  relation  of  the  form  ck  =  0,  then  R 
becomes  inconsistent  and  contains  a  minimal  conflict  M  of  size 
IMI  =  2  +  3  k.  This  gives  us  8  different  sets  Rk  of  size  138, 
containing  a  minimal  conflict  M  of  size  2  +  3  k. 


Figure  8.  Empirical  results 

For  each  k  isConsistent(Rk)  is  run  for  consistency  check  and  it 
returns  a  refutation  tree  that  is  used  as  input  by  both  MC  and  XC1. 
The  leaves  of  this  tree  represent  an  initial  (not  necessarily  minimal) 
conflict  of  size  n  -  see  Figure  8.  The  table  gives  the  average  results 
obtained  for  running  both  algorithms  100  times  for  all  eight  Rk. 
For  each  run,  we  permutated  the  order  of  the  input  relations  which 
resulted  in  different  structures  of  the  derived  aggregation  trees.  The 
columns  in  the  table  denote  the  average  number  of  join  and  project 


operations  needed  for  conflict  detection  in  isConsistent  and  by  the 
subsequent  minimization  call  to  MC  or  to  XC1.  The  last  column 
gives  the  ratio  of  the  measured  runtimes  for  MC  and  XC1.  For 
example,  for  the  case  of  a  minimal  conflict  of  size  2,  the  average 
initial  conflict  provided  by  isConsistent  has  size  8.5  and  it  takes 
59.1  aggregations  (join  followed  by  project)  to  detect  the  conflict. 
MC  needs  then  95.7  more  joins  and  91.6  projections  to  minimize 
the  initial  conflict  by  suspension,  while  XC1  is  446  times  faster 
than  MC  and  needs  only  8.8  joins  and  3.5  projections  for  the  same 
task.  The  performance  gain  of  XC1  relative  to  MC  depends 
strongly  on  the  structure  of  the  proof  trees  supplied  by  the  solver  - 
i.e.  whether  the  conflicting  assumptions  are  uniformly  spread 
among  the  leaves  of  the  tree  or  whether  they  are  clustered  in  a  few 
sub-trees. 

5  RELATED  WORK  AND  DISCUSSION 

Dependency  recording,  like  the  one  performed  by  TMSs  (cf.  [2,  3, 
18] )  works  relatively  fine  as  long  as  we  stay  in  a  propositional 
framework  (or,  anyway,  in  a  finite  world).  In  more  expressive 
frameworks  these  techniques  gradually  become  both 

•  very  resource  consuming  (in  time  and  space) 

•  incomplete  with  respect  to  the  more  expressive  framework. 
Constraint  suspension  is  another  technique  used  for  conflict 
computation.  It  is  in  general  expensive  because  it  relies  on 
performing  the  consistency  check  many  times  for  different  subsets 
of  the  initial  problem.  A  recent  enhancement  to  constraint 
suspension  is  the  one  reported  in  [8],  The  performance  of  the 
conflict  computation  is  improved  there  in  two  ways: 

(a)  by  adding  the  constraints  to  the  solver’s  store  one  after  the 
other  and  performing  each  time  a  complete  consistency  check, 
one  knows  that  the  last  constraint  added  that  caused  the  store 
to  become  inconsistent  is  part  of  all  conflicts  from  the  already 
considered  subset;  and 

(b)  by  employing  an  intelligent  search,  where  sets  of  constraints 
are  simultaneously  suspended  and  then  are  binary  split  if 
necessary. 

The  proof  structure  corresponding  to  the  control  strategy  assumed 
by  Junker  is  a  linear  tree.  We  do  not  need  to  enforce  the  sequential 
consistency  check  as  assumed  by  (a).  We  can  assume  any 
clustering  technique,  such  as  the  ones  resulting  after  structure 
analysis,  e.g.  cycle-cutset,  hypertree  decomposition,  etc.  (cf.  [10]), 
and  thus  take  advantage  of  the  performance  improvements  for 
constraint  solving  enabled  by  these  methods.  Our  solution  suits 
better  the  solvers  employing  such  decompositions  or  the  solvers 
that  are  recording  (at  least  partially)  their  proof  structures  in  order 
to  support  incremental  operation.  Although  developed 
independently  and  using  quite  differing  notations,  the  principle 
underlying  the  decomposition  of  the  conflict  minimization 
problems  is  the  same  for  our  XC1  and  for  Junker’s  QuickXplain 
algorithm.  After  several  years  of  trying  to  improve  constraint 
solving  and  dependency  recording  (cf.  [15]  ),  the  existence  of  such 
simple  and  general  algorithms  for  minimal  conflict  computation 
came  for  us  as  a  surprising  positive  result. 

One  of  our  main  application  areas  is  model-based  diagnosis. 
We  do  not  argue  here  that  one  should  perform  diagnosis  by  always 
first  computing  conflicts  and  then  generating  minimal  /  preferred  / 
etc.  diagnoses.  Several  authors  point  out  that  the  direct 
computation  of  diagnoses  can  be  more  efficient  (cf.  [16,  19,  21, 
23]  ).  The  ideas  from  an  algorithm  such  as  TREE*  (cf.  [21])  can  be 
probably  easily  adapted  to  a  general  relational  framework  such  as 


the  one  of  RCS.  One  weak  point,  however,  of  the  available 
computation  techniques  that  are  not  based  on  conflicts  is  that  they 
basically  address  static  problems.  It  would  be  interesting  to  see  if 
the  ideas  of  the  temporal  decomposition  that  can  be  applied  for 
computing  minimal  conflicts  (cf.  [14])  can  be  also  applied  for  the 
direct  computation  of  diagnoses  or  interpretations. 

Although  we  discussed  here  about  the  computation  of  minimal 
conflicts,  in  practice  minimality  and  completeness  have  to  be 
traded  against  efficiency.  Nevertheless,  sometimes  the  definition  of 
the  application  (minimisation,  compilation,  explanation,  etc) 
require  a  higher  degree  of  completeness  that  is  more  important 
than  the  computation  times. 

REFERENCES 

[1]  Y.  El  Fattah:  An  Elimination  Algorithm  for  Model-based  Diagnosis. 
Dx98,  Cape  Cod,  USA,  pp.  47-54,  1998. 

[2]  K.  Forbus,  J.  de  Kleer:  Building  Problem  Solvers.  MIT  Press,  1993. 

[3]  J.  de  Kleer:  An  Assumption-based  truth  maintenance  system. 
Artificial  Intelligence,  28,  pp.  127-162,  1986. 

[4]  J.  de  Kleer,  B.  Williams:  Diagnosing  Multiple  Faults.  Artificial 
Intelligence ,  32,  pp.  97-130,  1987. 

[5]  R.  Dechter:  Bucket  Elimination:  a  Unifying  Framework  for 
Reasoning.  Artificial  Intelligence.  1 13,  pp.  41  -  85,  1999. 

[6]  D.  J.  Goldstone:  Controlling  inequality  reasoning  in  a  TMS-based 
analog  diagnosis  system.  911'  Nat.  Conf.  On  AI.  pp.  512-517,  1991. 

[7]  R.  Bakker,  F.  Dikker,  F.  Tempelman,  P.  Wognum:  Diagnosing  and 
solving  over-determined  CSP.  Proc.  IJCAI-93,  1993 

[8]  U.  Junker:  QUICKXPLAIN:  Conflict  Detection  for  Arbitrary 
Constraint  Propagation  Algorithms.  IJCAI'01  Workshop  on 
Modelling  and  Solving  Problems  with  constraints ,  pp.  75-82,  2001. 

[9]  N.  Muscettola,  P.  Nayak,  B.  Pell,  B.  Williams:  Remote  Agent:  To 
boldly  go  where  no  AI  system  has  gone  before.  Artificial  Int.,  103, 
pp.  5-47,  1998. 

[10]  G.  Gottlob,  N.  Feone,  F.  Scarcello:  A  comparison  of  structural  CSP 
decomposition  methods.  Artificial  Int..  124(2),  pp.  243-282,  2000. 

[11]  A.  Fleming,  G.  Friedrich,  D.  Jannach,  M.  Stumptner:  Consistency- 
based  Diagnosis  of  Configuration  Knowledge  Bases.  ECAI-2000. 
Berlin,  2000. 

[12]  J.  Mauss,  V.  May,  M.  Tatar:  Towards  Model-based  Engineering: 
Failure  Analysis  with  MDS.  ECAI-2000  Workshop  W3I.  2000. 
http://www.dbai.tuwien.ac.at/event/ecai2000-kbsmbe/papers.html 

[13]  F.  Bouquet,  P.  Jegou:  Solving  over-constrained  CSP  using  weighted 
OBDDs.  Proc.  Over-Constrained  Systems,  Fecture  Notes  in 
Computer  Science,  Vol.  1106,  Springer,  Berlin,  1996. 

[14]  M.  Tatar  :  Diagnosis  with  cascading  defects.  ECAI-1 996.  1996. 

[15]  M.  Tatar :  Model-based  failure  analysis  in  engineering  —  an 
experience  report.  Invited  talk  at  Dx  2001.  Available  at  request. 

[16]  J.  Amilhastre,  H.  Fargier,  P.  Marquis:  Consistency  restoration  and 
explanations  in  dynamic  CSPs.  Artificial  Intelligence  135,  2002. 

[17]  G.  Katsillis,  M.  Chantler:  Can  Dependency-based  Diagnosis  Cope 
with  Simultaneous  Equations?  Dx97,  France,  1997. 

[18]  D.  McAllester:  Truth  Maintenance.  AAAI-90,  pp.  1 109-1 1 16,  1990. 

[19]  W.  Nejdl,  B.  Giefer:  DRUM:  Reasoning  without  conflicts  and 
justifications.  Dx94  ,  pp.  226-233,  New  Paltz,  NY,  1994. 

[20]  M.  Tatar,  P.  Dannenmann:  Integrating  Simulation  and  model-based 
Diagnosis  into  the  Fife  Cycle  of  Aerospace  Systems.  Dx99,  Foch 
Awe,  Scotland,  1999. 

[21]  M.  Stumptner,  F.  Wotawa:  Diagnosing  tree-stmctured  systems. 
Artificial  Intelligence,  127,  pp.  1-29,  2001. 

[22]  F.  Feldkamp,  M.  Heinrich,  K.-D.  Meyer-Gramann:  SyDeR  -  System 
Design  for  Reusability.  AI-EDAM  Special  Issue  on  Configuration 
Design.  Sept.  1998. 

[23]  A.  Darwiche:  Decomposable  Negation  Normal  Form.  Journal  of 
ACM,  July  2001. 


A  Model  Counting  Characterization  of  Diagnoses 


T.  K.  Satish  Kumar 

Knowledge  Systems  Laboratory 
Stanford  University 
tksk@ksl.stanford.edu 


Abstract 

Given  the  description  of  a  physical  system  in  one 
of  several  forms  (a  set  of  constraints,  Bayesian  net¬ 
work  etc.)  and  a  set  of  observations  made,  the 
task  of  model-based  diagnosis  is  to  find  a  suitable 
assignment  to  the  modes  of  behavior  of  individ¬ 
ual  components  (this  notion  can  also  be  extended 
to  handle  transitions  and  dynamic  systems  [Kurien 
and  Nayak,  2000].  Many  formalisms  have  been 
proposed  in  the  past  to  characterize  diagnoses  and 
systems.  These  include  consistency-based  diag¬ 
nosis,  fault  models,  abduction,  combinatorial  op¬ 
timization,  Bayesian  model  selection  etc.  Different 
approaches  are  apparently  well  suited  for  different 
applications  and  representational  forms  in  which 
the  system  description  is  available.  In  this  paper, 
we  provide  a  unifying  theme  behind  all  these  ap¬ 
proaches  based  on  the  notion  of  model  counting. 

By  doing  this,  we  are  able  to  provide  a  universal 
characterization  of  diagnoses  that  is  independent  of 
the  representational  form  of  the  system  description. 

We  also  show  how  the  shortcomings  of  previous  ap¬ 
proaches  (mostly  associated  with  their  inability  to 
reason  about  different  elements  of  knowledge  like 
probabilities  and  constraints)  are  removed  in  our 
framework.  Finally,  we  report  on  the  computational 
tractability  of  diagnosis-algorithms  based  on  model 
counting. 

1  Introduction 

Diagnosis  is  an  important  component  of  autonomy  for  any 
intelligent  agent.  Often,  an  intelligent  agent  plans  a  set  of 
actions  to  achieve  certain  goals  and  because  some  conditions 
may  be  unforeseen,  it  is  important  for  it  to  be  able  to  recon¬ 
figure  its  plan  depending  upon  the  state  in  which  it  is.  This 
state  identification  problem  is  essentially  a  problem  of  diag¬ 
nosis.  In  its  simplest  form,  the  problem  of  diagnosis  is  to  find 
a  suitable  assignment  to  the  modes  of  behavior  of  individual 
components  in  a  static  system  (given  some  observations  made 
on  it).  It  is  possible  to  handle  the  case  of  dynamic  systems  by 
treating  the  transition  variables  as  components  (in  one  sense) 
[Kurien  and  Nayak,  2000].  The  theory  developed  in  this  pa¬ 
per  is  therefore  equally  applicable  to  dynamic  systems  too 


(although  we  omit  the  discussion  due  to  restrictions  on  the 
length  of  the  paper). 

Many  approaches  have  been  used  in  the  past  to  character¬ 
ize  diagnoses  and  systems.  Among  the  most  comprehensive 
pieces  of  work  are  [de  Kleer  and  Williams,  1989],  [Reiter, 
1987],  [Struss  and  Dressier,  1989],  [Console  et  al.,  1989], 
[de  Kleer  et  al.,  1992],  [Poole,  1994],  [Kohlas  et  al.,  1998] 
and  [Lucas,  2001].  The  popular  characterizations  of  diag¬ 
noses  include  consistency-based  diagnosis,  fault  models,  ab¬ 
duction,  combinatorial  optimization,  and  Bayesian  model  se¬ 
lection.  These  approaches  are  however  tailored  for  different 
applications  and  representational  forms  in  which  the  system 
description  is  available.  They  also  have  one  or  more  short¬ 
comings  arising  out  of  their  inability  to  provide  for  a  frame¬ 
work  that  can  incorporate  knowledge  in  different  forms  like 
probabilities,  constraints  etc. 

In  this  paper,  we  provide  a  unifying  theme  behind  all  these 
approaches  based  on  the  notion  of  model  counting.  By  doing 
this,  we  are  able  to  provide  a  universal  characterization  of  di¬ 
agnoses  independent  of  the  representational  form  of  the  sys¬ 
tem  description.  Because  model  counting  bridges  the  gap  be¬ 
tween  different  kinds  of  knowledge  elements,  the  shortcom¬ 
ings  of  previous  approaches  are  removed. 

2  Background 

Before  we  present  our  characterization  of  diagnoses  based  on 
model  counting,  we  choose  to  provide  a  quick  overview  of 
the  previous  approaches  so  that  we  can  compare  and  contrast 
our  approach  with  them. 

Definition  (Diagnosis  System)  A  diagnosis  system  is  a  triple 
(SD,  COMPS,  OBS)  such  that: 

1.  SD  is  a  system  description  expressed  in  one  of  several 
forms  —  constraint  languages  like  propositional  logic,  prob¬ 
abilistic  models  like  Bayesian  network  etc.  SD  specifies  both 
component  behavior  information  and  component  structure  in¬ 
formation  (i.e.  the  topology  of  the  system). 

2.  COMPS  is  a  finite  set  of  components  of  the  system.  A 
component  compi  (1  <  i  <  \COMPS\)  can  behave  in  one 
of  several,  but  finite  set  of  modes  (Mi).  If  these  modes  are 
not  specified  explicitly,  then  we  assume  two  modes  —  failed 
(AB (compi))  and  normal  (~>AB (compi)). 

3.  OBS  is  a  set  of  observations  expressed  as  variable  values. 
Definition  The  task  in  a  complete  diagnosis  call  is  to  find  a 
“suitable”  assignment  of  modes  to  all  the  components  in  the 


system  given  SD  and  OBS.  The  task  in  a  partial  diagnosis 
call  is  to  find  a  suitable  assignment  of  modes  to  a  specified 
subset  S  (S  C  COMPS)  of  the  components  in  the  system 
given  SD  and  OBS. 

Unless  stated  otherwise,  we  will  use  the  term  “diagnosis” 
to  refer  to  a  complete  diagnosis.  Later  in  the  paper  we  will 
show  that  the  characterization  of  partial  diagnoses  is  a  simple 
extension  of  the  characterization  of  complete  diagnoses. 
Definition  (Candidate)  Given  a  set  of  integers 
h  ■  ■  ■  i\coMPS\  (such  that  for  1  <  j  <  \COMPS\, 
1  <  ij  <  \Mj\),  a  candidate  Cand(ii  ■  ■  ■  i\comps\)  is 

defined  as  Cand(ii  ■  ■  ■  i\coMPS\)  =  (U (comp*  = 

Mk(ik)))- 

Here,  Mu(v)  denotes  the  vth  element  in  the  set  Mu  (assumed 
to  be  indexed  in  some  way). 

Notation  When  the  indices  are  implicit  or  arbitrary,  we  will 
use  the  symbol  H  to  denote  a  candidate  or  a  hypothesis  i.e. 
an  assignment  of  modes  to  all  the  components  in  the  system. 

Consistency-Based  Diagnosis 

The  task  of  consistency-based  diagnosis  can  be  summarized 
as  follows.  Note  that  the  definition  of  a  diagnosis  in  this 
framework  does  not  discriminate  between  single  and  multi¬ 
ple  faults. 

Definition  (Consistency-Based  Diagnosis)  A  candidate  H  is 
a  diagnosis  if  and  only  if  SD  U  OBS  U  H  is  satisfiable. 

There  are  other  characterizations  of  diagnoses  under  this 
framework  called  partial  diagnoses,  prime  diagnoses,  kernel 
diagnoses  etc.  We  will  examine  these  later  in  the  paper. 

Fault  Models 

Consider  diagnosing  a  system  consisting  of  three  bulbs 
B\ ,  B>  and  B:>  connected  in  parallel  to  the  same  volt¬ 
age  source  V  under  the  observations  off(B\),  off(B-2) 
and  on(Bf).  AB(V)  A  AB(B^)  is  a  diagnosis  under  the 
consistency-based  formalization  of  diagnosis  if  we  had  con¬ 
straints  only  of  the  form  -^AB(Bf)  A  -i AB(V)  — »  on(B 3). 
Intuitively  however,  it  does  not  seem  reasonable  because  B3 
cannot  be  on  without  V  working  properly.  One  way  to  get 
around  this  is  to  include  fault  models  in  the  system.  These  are 
constraints  that  explicitly  describe  the  behavior  of  a  compo¬ 
nent  when  it  is  not  in  its  nominal  mode  (most  expected  mode 
of  behavior  of  a  component).  Such  a  constraint  in  this  exam¬ 
ple  would  be  AB (B$ )  — ►  off(Bs).  Diagnosis  can  become 
indiscriminate  without  fault  models.  It  is  also  easy  to  see 
that  the  consistency-based  approach  can  exploit  fault  models 
(when  they  are  specified)  to  produce  more  intuitive  diagnoses 
(like  only  B\  and  B>  being  abnormal). 

Diagnosis  as  Combinatorial  Optimization 

The  technique  of  using  fault  models  is  associated  with  the 
problem  of  being  too  restrictive.  We  may  not  be  able  to  model 
the  case  of  some  strange  source  of  power  making  B:i  on  etc. 
The  way  out  of  this  is  to  allow  for  many  modes  of  behavior 
for  each  component  of  the  system.  Every  component  has  a 
set  of  modes  (in  which  it  can  behave)  with  associated  mod¬ 
els.  One  of  these  is  the  nominal  (or  normal)  mode  and  the 
others  are  fault  modes.  Each  component  has  the  unknown 
fault  mode  with  the  empty  model.  The  unknown  mode  tries 
to  capture  the  modeling  incompleteness  assumption  (obscure 


modes  that  we  cannot  model  in  the  system).  Also,  each  mode 
has  an  associated  probability  that  is  the  prior  probability  of 
the  component  being  in  that  mode.  Diagnosis  can  now  be  cast 
as  a  combinatorial  optimization  problem  of  assigning  modes 
of  behavior  to  each  component  such  that  it  is  not  only  con¬ 
sistent  with  SD  U  OBS,  but  also  maximizes  the  product  of 
the  prior  probabilities  associated  with  those  modes  (assuming 
independence  in  the  behavior  of  components). 

Definition  ( Combinatorial  Optimization )  A  candidate  H  = 
CandHi  •  •  •  i\coMPS\)  is  a  diagnosis  if  and  only  if  SD  UH  U 

OBS  is  satisfiable  and  P(H)  =  (j[A^MPS\ p(compk  = 
Mk(ik)))  is  maximized. 

Diagnosis  as  Bayesian  Model  Selection 

Sometimes  we  have  sufficient  experience  and  statistical  in¬ 
formation  associated  with  the  behavior  of  a  system.  In  such 
cases,  the  system  description  is  usually  available  in  the  form 
of  a  probabilistic  model  like  a  Bayesian  network.  Given  some 
observations  made  on  the  system,  the  problem  of  diagnosis 
then  becomes  a  Bayesian  model  selection  problem. 
Definition  (Bayesian  Model  Selection)  A  candidate  p[ 
is  a  diagnosis  (for  a  probabilistic  model  of  the  system, 
SD)  if  and  only  if  it  maximizes  the  posterior  probability 
P(H/SD,  OBS). 

Diagnosis  as  Abduction 

Yet  another  intuition  behind  characterizing  diagnoses  is  the 
idea  of  explanation.  Explanatory  diagnoses  essentially  try  to 
capture  the  notion  of  cause  and  effect  in  the  physics  of  the 
system.  The  observations  are  asymmetrically  divided  into  in¬ 
puts  (I)  and  outputs  (O)  [de  Kleer  et  al.,  1992],  The  inputs 
(I)  are  those  observation  variables  that  can  be  controlled  ex¬ 
ternally. 

Definition  (Abductive  Diagnosis):  An  abductive  diagnosis 
for  (SD,  COMPS,  OBS  =  I  U  O)  is  a  candidate  H  such 
that  SD  U  I  U  H  is  satisfiable  and  SD  U  I U  H  -»  O. 

3  Probabilities  and  Model  Counting 

Before  we  present  our  own  characterization  of  diagnoses 
based  on  the  notion  of  model  counting,  we  show  an  interest¬ 
ing  relationship  between  probabilities  and  model  counting 
(see  Figure  1).  The  model  counting  problem  is  the  problem 
of  counting  the  number  of  solutions  to  a  SAT  (satisfiability 
problem)  or  a  CSP  (constraint  satisfaction  problem). 

Definition  (Binary  representation  of  a  CPT):  The  bi¬ 
nary  representation  of  a  CPT  ( Conditional  Probability  Table ) 
is  a  table  in  which  all  the  floating-point  entries  of  the  CPT 
are  re-written  in  a  binary  form  (base  2)  up  to  a  precision  of  P 
binary  digits  and  the  decimal  point  along  with  any  redundant 
zeroes  to  the  left  of  it  are  removed. 

We  provide  a  set  of  definitions  and  results  relating  the 
probability  of  a  partial  assignment  A  to  the  number  of 
solutions  (under  the  same  partial  assignment  A)  to  CSPs 
composed  out  of  the  binary  representations  of  the  CPTs  (see 
Figure  1).  Basic  definitions  related  to  CSPs  can  be  found  in 
[Dechter,  1992], 

Definition  (Zero-one-layer  of  a  CPT)  The  kth  zero-one-layer 
of  a  CPT  is  a  table  of  zeroes  and  ones  derived  from  the  kth 


Figure  1:  Shows  the  conditional  probability  tables  (CPTs)  of  a  Bayes  net  on  the  left  of  the  vertical  line  L.  On  the  right  of  L  are 
the  binary  representations  of  these  CPTs  (example  shown  for  0.4  in  decimal  =  0.01 1  in  binary).  CPTs  correspond  to  families  in 
the  Bayes  net  and  let  the  number  of  families  be  C. 


bit  position  of  all  the  numbers  in  the  binary  representation  of 
that  CPT. 

Definition  (Weight  of  a  zero-one-layer)  The  kth  zero-one- 
layer  of  a  CPT  is  defined  to  have  weight  2~k. 

Definition  (CSP  Compilation  of  a  CPT )  The  kth  CSP 
compilation  of  a  CPT  is  a  constraint  over  the  variables  of  the 
CPT  that  is  derived  from  the  kth  zero-one-layer  of  the  CPT 
such  that  zeroes  correspond  to  disallowed  tuples  and  ones 
correspond  to  allowed  tuples. 

Definition  ( CSP  Compilation  of  Network)  The  (k\ .  k->  ■  ■  ■  kc) 
CSP  compilation  of  the  Network  is  the  set  of  constraints  S  = 
{sj  :  Si  is  the  k\h  CSP  compilation  of  the  ith  CPT}. 
Definition  (Weight  of  a  CSP  Compilation )  The  weight  of  a 
(ki ,  &2  ■  ■  ■  kc)  CSP  compilation  of  a  network  is  defined  to  be 
equal  to  2~(kl+k2'"kc\ 

Property  There  are  an  exponential  number  of  CSP  compi¬ 
lations  for  a  given  network.  Since  each  CPT  expands  into 
P  zero-one-layers  and  a  CSP  for  the  entire  network  can  be 
compiled  by  taking  any  of  these  P  layers  for  each  CPT  (there 
are  C  CPTs),  the  total  number  of  CSP  compilations  possible 
is  P° . 

Notation  We  will  use  the  notation  hij  to  mean  the  jih  CSP 
compilation  of  the  ith  CPT.  Let  A  indicate  a  complete  or 
partial  assignment  to  the  variables.  If  A  is  an  assignment 
that  instantiates  all  the  variables  of  CPT*.  then  we  will  use 
the  notation  hij  (A)  to  indicate  whether  or  not  A  satisfies 
hij.  If  A  is  a  complete  assignment  for  all  the  variables  in 


the  network,  then  all  variables  for  all  CPTs  are  instantiated 
and  we  will  use  the  notation  CSP^ltk2...kc)(A)  to  indicate 
whether  A  satisfies  all  the  constraints  (1  <  i  <  C).  If 
A  is  not  a  complete  assignment  for  all  the  variables,  then  we 
will  use  the  notation  #CSP(kltk2—kc)(A)  to  indicate  the 
number  of  solutions  to  the  {k\,k2  •  ■  ■  kc)  CSP  compilation 
of  the  network  that  share  the  same  partial  assignment  as  A. 
Theorem  1  The  probability  of  a  complete  assignment 
A  =  (Xi  =  X\  ■  ■  ■  Xn  =  xn)  is  just  the  sum  of  the 
weights  of  the  different  CSP  compilations  of  the  network 
that  are  satisfied  by  this  complete  assignment.  That  is, 

P{A)  =  E(*llta...*ff)CSP(*llta...4o)(A) 

(for  all  1  <  i  <  C,  1  <  ki  <  P). 

Proof  Consider  the  complete  assignment  A  =  (Xi  = 
x i  •  •  ■  Xn  =  xn)  for  all  the  variables.  The  probability  of  this 
assignment  is  equal  to  the  product  of  the  probabilities  defined 
locally  by  each  CPT.  Now  using  the  fact  that  the  tth  bit  in  the 
binary  representation  of  this  local  value  has  been  written  out 
as  an  allowed  or  disallowed  tuple  in  the  tth  CSP  compilation 
of  that  CPT,  we  can  rewrite  the  local  value  for  A  in  a  CPT 
as  y.j-i  hij  (A)2~3.  The  total  probability  is  then  just  the 

product  over  all  local  values  =  II^=1  Yh*j= i  ^kj(A) 2~3. 
Expanding  the  product,  we  see  that  each  term  is  essentially 
of  the  form  E(kuk2---kc)2~(kl+k2'''kc)llf=ihjkj(A)  = 

£(*llfe...*0)  2-^+k2'--kc)csp{klM...kc)(Ay 


Theorem  2  (Model  Counting)  The  marginalized  prob¬ 
ability  of  a  partial  assignment  A  to  a  set  of  variables 
S  C  V  is  equal  to  the  product  of  the  weight  and  the 
number  of  solutions  (under  the  same  partial  assignment  A) 
summed  over  all  CSP  compilations  of  the  network.  That  is, 

P{A)  =  E(*ll*a...M#C'5P(*ll*2...fcc)(^)2“(^+*-*°) 
(for  all  1  <  i  <  C,  1  <  <  P ). 

Proof  From  the  previous  theorem,  we  know  that 
the  probability  of  a  complete  assignment  B  is 
CSPiklM...kc)(B) (f0r  all 
1  <  i  <  C,  1  <  kj  <  P).  Now,  the  marginalized 
probability  of  a  partial  assignment  A  is  just  the  sum  of 
the  probabilities  of  all  complete  assignments  B  that  agree 
with  A  on  the  assignment  to  variables  in  S.  That  is, 
P(A)  =  J2B  P(B)(B(S)  =  A).  Using  the  result  of 
the  previous  theorem  to  expand  P(B),  we  have  P(A)  = 
Zb  Z{klM-kc)  CSP{klM...kc)(B) 2-(*i+*a-*c)(B(5)  = 
.4).  Switching  the  two  summations  and  noting  that 
Y,ECSP{kuk2...kc)(B)(B(S )  =  A)  is  the  same 

as  Z(klM-ka)#CSP(kute -kc)(A)’  we  8et  that 

P(A)  =  E(kllka...ko)  #CSP(kl:k2...kc)(A) 2-(*1+*2"-K 

3.1  Probability-Equivalents  and  Incorporation  of 
Probabilities 

Often,  we  are  given  information  in  many  forms.  Probabilities 
are  natural  information  elements  when  there  is  an  element  of 
statistical  experience  that  we  want  to  exploit.  In  other  cases, 
constraints  may  be  the  most  appropriate  to  use.  The  general 
idea  in  our  framework  is  to  use  probabilities  when  we  explic¬ 
itly  have  them  and  to  use  model  counting  otherwise.  We  will 
use  #  (Si ,  S'2  ■■■)  to  mean  the  number  of  consistent  models  to 
(Si  U  S-2  ■  ■  ■)  (with  respect  to  the  uninstantiated  free  variables 
in  SD ).  Theorems  1  and  2  establish  that  model  counting  is 
a  weaker  form  of  probabilities  and  that  probabilities  provide 
only  precision  information  over  model  counting.  Therefore,  it 
is  natural  for  us  to  use  probabilities  (to  describe  events)  when 
we  have  them  explicitly,  and  to  use  model  counting  otherwise. 
For  any  event  E,  we  use  the  expressions  and  P(E) 

almost  equivalently  —  except  that  we  use  the  former  when 
we  do  not  know  P(E)  explicitly.  This  framework  allows  us 
to  reason  about  both  probabilities  and  constraints. 

Definition  (Probability  Equivalents )  The  probability  equiv¬ 
alent  of  #(SD,E)  for  any  event  E  is  defined  to  be 
P(E)#(SD)  when  P(E)  is  given  explicitly. 

4  Diagnosis  as  Model  Counting 

In  this  section,  we  characterize  diagnoses  based  on  model 
counting.  We  will  then  show  how  all  the  previous  approaches 
are  captured  under  this  formalization.  For  the  first  part  of 
the  discussion  we  will  consider  only  complete  diagnoses  (an 
assignment  of  modes  for  all  the  components). 

Definition  (Model  Counting  Characterization)  A  diagnosis 
is  a  candidate  H  that  maximizes  the  number  of  consistent 
models  to  SD  U  OBS  U  II  using  probability  equivalents 
wherever  necessary. 

Notation  We  will  use  M(H)  to  denote  #(SD,  OBS.  II) 
(the  number  of  consistent  models  to  SD  U  OBS  U  H )  when 


SD  and  OBS  follow  from  context. 

Theorem  3  (Capturing  Consistency-Based  Diagnosis) 
Consistency-Based  diagnosis  is  looking  for  a  hypothesis  H 
for  which  M (II)  is  non-zero. 

Proof  By  definition,  consistency-based  diagnosis  chooses  H 
such  that  SD  U  OBS  U II  is  consistent.  In  other  words,  there 
exists  at  least  one  satisfying  assignment  for  SD  U  OBS  U  H. 
Clearly,  this  is  equivalent  to  saying  that  M (H)  is  non-zero. 
Theorem  4  (Capturing  Abduction)  Abduction  chooses  a 
hypothesis  H  that  maximizes  M(H)  assuming  uniformity  in 
prior  probabilities  P(H). 

Proof  The  maximum  value  of  #(SD,  OBS  =  I  U  O,  H) 
is  #(SD,  H ,  I)  and  this  happens  when  H  U  SD  U  I  —y  O. 
Given  that  the  input  variables  are  controlled  externally,  we 
know  that  #(SD,H)  =  N(I)#(SD,H,I).  Here,  N (I) 
is  a  constant  that  measures  the  number  of  different  values 
for  the  input  variables.  Since  #(SD,H)  is  equivalent  to 
P(H)#(SD)  which  we  assumed  to  be  a  constant  for  all 
H ,  maximizing  f(SD.  OBS,  H)  is  equivalent  to  finding 
a  hypothesis  H  for  which  I  -y  O  (under  SD).  The  fact 
that  abduction  requires  H  to  be  consistent  is  also  captured, 
because  if  H  is  inconsistent,  then  M(H)  =  0  and  clearly 
M ( H )  will  not  be  maximized. 

Theorem  5  ( Capturing  Bayesian  Model  Selection )  Bayesian 
model  selection  chooses  a  hypothesis  H  such  that  it  maxi¬ 
mizes  the  probability  equivalent  of  M (H). 

Proof  The  probability  equivalent  of  M(H)  = 
#(SD,OBS,H)  is  P(OBS,H).  Clearly,  if  we  are 
maximizing  P(OBS,  H)  then  we  are  maximizing 
P(H/OBS)P(OBS).  Since  P(OBS)  is  independent 
of  H,  it  is  equivalent  to  maximizing  P(H/OBS)  which  is 
exactly  what  Bayesian  model  selection  does. 

Theorem  6  ( Capturing  Combinatorial  Optimization )  Com¬ 
binatorial  optimization  is  looking  for  a  hypothesis  H  which 
maximizes  P(H )  under  the  condition  that  M(H)  is  non¬ 
zero. 

Proof  As  noted  earlier,  H  is  consistent  with  SD  U  OBS 
if  and  only  if  M(H)  is  non-zero.  We  also  know  that 
combinatorial  optimization  is  looking  for  a  consistent  H 
which  maximizes  P(H).  The  theorem  follows  as  a  simple 
consequence  of  the  above  two  statements.  Basically,  combi¬ 
natorial  optimization  maximizes  only  the  prior  probabilities 
of  hypotheses  (instead  of  maximizing  the  equivalent  of  the 
posterior  probabilities)  unless  they  are  obviously  ruled  out 
by  being  inconsistent. 

4.1  Consequences  (Removing  Previous 
Shortcomings) 

We  now  show  the  consequences  of  formalizing  diagnosis  as 
model  counting.  In  particular,  we  identify  problems  with  pre¬ 
vious  approaches  and  show  how  model  counting  removes  all 
of  them. 

Problems  with  Consistency-Based  Diagnosis 

One  of  the  problems  with  consistency-based  diagnosis  is  that 
it  allows  for  non-intuitive  hypotheses  as  diagnoses.  It  pro¬ 
vides  only  for  a  necessary  but  not  a  sufficient  condition  on 
the  hypotheses  that  can  be  qualified  as  diagnoses.  By  itself,  it 
is  of  little  value  unless  we  use  an  elaborate  set  of  fault  models 


to  remove  non-intuitive  hypotheses  that  could  otherwise  be 
consistent.  Model  counting  removes  these  problems  because 
of  its  ability  to  merge  and  incorporate  the  notions  of  both 
consistency  and  probabilities.  In  one  sense,  one  can  think  of 
model  counting  as  giving  us  a  measure  of  the  degree  to  which 
a  hypothesis  is  consistent  with  SD  and  OBS.  Some  of  these 
problems  are  alternatively  addressed  in  [Kohlas  et  al.,  1998] 
and  [Lucas,  2001]. 

Problems  with  Fault  Models 

The  problem  with  fault-models  is  that  of  over-restriction  (as 
explained  at  the  beginning  of  the  paper).  We  need  to  be  able 
to  reason  not  only  about  constraints  relating  SD  and  OBS , 
but  also  about  any  other  kind  of  information  we  may  have 
in  the  form  of  probabilities  etc.  The  over-restriction  problem 
can  be  removed  by  introducing  probabilities.  These  proba¬ 
bilities  can  then  be  used  in  the  unified  framework  of  model 
counting. 

Problems  with  Abduction 

Like  the  consistency-based  approaches,  explanatory  diag¬ 
noses  are  also  unable  to  incorporate  and  reason  about  proba¬ 
bilities.  Yet  another  problem  with  abduction  is  that  it  assumes 
we  have  completely  modeled  all  cause-effect  relationships  in 
our  system.  This  contradicts  our  modeling  incompleteness 
assumption  and  is  an  unnecessary  restriction  on  SD.  Model 
counting  removes  this  problem  in  a  way  very  similar  to  how 
probabilities  were  used  to  deal  with  the  modeling  incomplete¬ 
ness  assumption.  Alternate  treatments  for  these  problems 
can  be  found  in  [Poole,  1994]  (which  links  abduction  with 
probabilistic  reasoning)  and  [Console  et  al.,  1989]  (which  ad¬ 
dresses  the  modeling  incompleteness  assumption). 

Problems  with  Diagnosis  as  Bayesian  Model  Selection 

Bayesian  model  selection  agrees  with  our  characterization  of 
diagnoses  —  but  the  only  problem  it  poses  is  that  it  requires 
S'!)  to  be  in  the  form  of  a  Bayesian  network  with  known  prob¬ 
abilities.  Modeling  a  physical  system  as  a  Bayesian  network 
is  in  many  cases  a  non-intuitive  thing  to  do.  This  is  especially 
so  when  certain  probability  terms  are  hard  to  get.  Parts  of 
the  system  may  be  better  expressed  in  the  form  of  constraints 
or  automata.  In  such  cases,  Bayesian  model  selection  does 
not  extend  in  a  natural  way  and  model  counting  is  the  right 
substitute  (because  it  is  defined  under  all  frameworks). 

Problems  with  Diagnosis  as  Combinatorial  Optimization 

One  problem  associated  with  casting  diagnosis  as  a  combi¬ 
natorial  optimization  problem  is  that  of  being  unable  to  give 
explanatory  diagnoses  a  preference  over  the  rest.  Clearly,  we 
would  like  to  prefer  hypotheses  that  not  only  maximize  the 
prior  probability  P(H)  but  that  are  also  explanatory  rather 
than  just  being  consistent  with  SDLiOBS.  One  way  to  incor¬ 
porate  this  preference  is  to  find  all  consistent  hypotheses  that 
maximize  P(H)  and  to  pick  an  explanatory  one  among  them. 
The  question  that  arises  then  is  how  we  would  compare  two 
hypotheses  one  of  which  is  explanatory  and  the  other  just  con¬ 
sistent  (but  not  explanatory),  with  the  latter  having  a  slightly 
better  prior  probability.  This  question  is  left  unanswered  un¬ 
der  the  combinatorial  optimization  formulation  of  diagnoses. 
In  the  model  counting  framework  however,  it  is  easy  to  see 


that  we  really  have  to  maximize  P(H)  ■  The 

second  factor  is  maximized  for  explanatory  diagnosis  —  but 
this  is  as  much  as  the  preference  we  attach  for  them. 

Another  problem  with  the  combinatorial  optimization  for¬ 
mulation  is  that  probabilities  are  restricted  to  only  behavior 
modes  of  components  and  only  these  prior  probabilities  are 
maximized.  There  is  no  framework  to  reason  about  proba¬ 
bilistic  information  connected  with  observation  variables. 

5  Partial  Diagnoses 

Sometimes,  we  are  interested  in  finding  a  suitable  assignment 
of  modes  to  a  specified  subset  S  of  the  components  COMPS 
rather  than  for  all  components.  We  argue  that  our  characteri¬ 
zation  of  diagnoses  under  the  model  counting  framework  re¬ 
mains  unchanged. 

Definition  (Candidate)  Given  a  set  of  integer  tuples 
(ki,iki)  •  •  •  (kn,  ik„)  such  that  for  1  <  j  <  n  <  \COMPS\. 
1  <ikj  <  \Mj\,  a  candidate  Cand((k\,  ikfl)  (kn,  ik„))  is 
defined  as  Cand((kx,  ikl)  ■  ■  ■  (kn,  ik„ ))  =  (U"=i (c°mpg  = 

Mg(ikg))). 

Notation  When  the  indices  are  implicit  or  arbitrary,  we  will 
use  the  notation  .Js  to  denote  a  candidate  or  a  hypothesis 
i.e.  a  set  of  mode  assignments  to  all  the  components  in 
5  C  COMPS. 

Definition  (Model  Counting  Characterization )  A  partial  di¬ 
agnosis  for  S  C  COMPS  is  an  assignment  of  modes  Js  to 
the  components  in  S  that  maximizes  #(SD,  OBS,  Js)  using 
probability  equivalents  wherever  necessary. 

It  is  now  not  hard  to  verify  that  all  previous  approaches  are 
captured  in  a  way  very  similar  to  that  for  complete  diagnoses. 
This  is  essentially  a  consequence  of  the  theorem  that  relates 
the  number  of  consistent  models  for  (SD,  OBS,  Js)  to  the 
marginalized  probability  of  Js  (Theorem  2).  Instead  of  pre¬ 
senting  the  proofs  again  (and  making  repetitive  arguments), 
we  choose  to  allude  to  another  set  of  characterizations  mostly 
associated  with  consistency-based  diagnosis.  These  are  the 
notions  of  partial  (a  different  characterization  in  consistency- 
based  diagnosis),  kernel  and  prime  diagnoses.  These  notions 
have  the  same  kind  of  drawbacks  associated  with  the  general 
consistency-based  framework  [de  Kleer  et  al.,  1992]  and  our 
investigation  into  these  notions  is  just  in  the  spirit  of  under¬ 
standing  their  relationship  to  model  counting. 

Definition  An  AS— literal  is  AB(c)  or  -i AB(c)  for  some 
component  c  in  COMPS.  An  AB— clause  is  a  disjunc¬ 
tion  of  AB— literals  containing  no  complementary  pair  of 
AB— literals. 

Definition  A  conflict  of  (SD,  COMPS,  OBS)  is  an 
AB— clause  entailed  by  SD  U  OBS.  A  minimal  conflict  of 
(SD,  COMPS,  OBS)  is  a  conflict  no  proper  sub-clause  of 
which  is  a  conflict  of  (SD,  COMPS,  OBS). 

Definition  ( Consistency-Based  Characterization )  The  partial 
diagnoses  of  (SD,  COMPS,  OBS)  are  the  implicants  of  the 
minimal  conflicts  of  (SD,  COMPS,  OBS). 

Theorem  7  A  partial  diagnosis  in  the  consistency-based 
framework  identifying  an  implicant  T  of  the  minimal 
conflicts  of  SD  U  OBS ,  is  also  a  partial  diagnosis 
in  the  model-counting  framework  maximizing  M(Js)  = 
fl(SD,  OBS,  Js)  for  S  =  variables  of  the  implicant  T,  but 


with  free  variables  limited  to  abnormality  ( AB )  variables. 
Proof  The  implicant  T  fixes  an  assignment  for  the  compo¬ 
nents  in  S  but  leaves  COMPS  \  S  unassigned.  Let  the 
set  of  minimal  conflicts  of  SD  U  OBS  be  n.  Let  #ab(E) 
denote  the  number  of  consistent  models  of  E  restricted  to 
free  variables  being  from  the  uninstantiated  AB— variables. 
Since  T  is  an  implicant  of  7t,  all  models  of  T  (restricted  to 
AB— variables)  also  satisfy  7t  and  are  hence  consistent  with 
SD  U  OBS.  This  makes  #Ab(SD,OBS,T)  =  #ab(T). 
In  general,  since  #ab(SD,  OBS,  T )  is  upper  bounded  by 
#ab  (T),  the  truth  of  the  theorem  follows. 

Definition  ( Consistency-based  Characterization )  A  kernel 
diagnosis  identifies  the  prime  implicants  of  the  minimal  con¬ 
flicts  of  SD  U  OBS. 

Without  a  detailed  discussion  (due  to  lack  of  space),  we 
claim  that  this  notion  is  related  to  yet  another  task  in  diagno¬ 
sis  —  that  of  “representing”  complete  diagnoses.  This  task 
is  orthogonal  to  “characterizing”  them  [Kumar,  2002].  There 
are  other  notions  of  diagnosis  called  prime  diagnoses,  irre- 
dundant  diagnoses  etc.  [de  Kleer  et  al.,  1992]  arising  mostly 
out  of  the  task  of  “representation”  and  all  of  which  are  cap¬ 
tured  in  one  or  the  other  way  by  the  model  counting  frame¬ 
work  (which  we  omit  in  this  paper). 

6  Related  Work  on  Characterizing  Diagnoses 
and  Model  Counting 

Related  work  in  trying  to  unify  model-based  and  probabilis¬ 
tic  approaches  can  be  found  in  [Poole,  1994],  [Kohlas  et  al., 
1998],  [Lucas,  1998]  and  [Lucas,  2001].  [Poole,  1994]  links 
abductive  reasoning  and  Bayesian  networks  and  general  diag¬ 
nostic  reasoning  systems  with  assumption-based  reasoning. 
[Kohlas  et  al.,  1998]  shows  how  to  take  results  obtained  by 
consistency  based  reasoning  systems  into  account  when  com¬ 
puting  a  posterior  probability  distribution  conditioned  on  the 
observations  (the  independence  assumptions  are  lifted  in  [Lu¬ 
cas,  2001]).  [Lucas,  1998]  gives  a  semantic  analysis  of  differ¬ 
ent  diagnosis  systems  using  basic  set  theory.  The  issue  of  the 
modeling  incompleteness  assumption  is  referred  to  in  [Con¬ 
sole  et  al.,  1989]. 

Diagnosis  algorithms  based  on  model  counting  have  not 
yet  been  developed.  However,  the  problem  of  model  count¬ 
ing  itself  has  been  extensively  dealt  with.  Although  this  prob¬ 
lem  is  #P-complete,  there  are  a  variety  of  techniques  that 
have  been  used  to  make  it  feasible  in  practice  (including  ap¬ 
proximate  counting  algorithms  running  in  polynomial  time, 
structure-based  techniques  etc.).  Model  counting  for  a  SAT 
instance  in  DNF  (disjunctive  normal  form)  is  simpler  than  it  is 
for  CNF  (conjunctive  normal  form).  For  DNF,  there  is  a  fully 
polynomial  randomized  approximation  scheme  (FPRAS)  to 
estimate  the  number  of  solutions  [Karp  et  al.,  1989].  CDP  and 
DDP  are  two  model-counting  algorithms  for  SAT  instances  in 
CNF  [Bayard  and  Pehoushek,  2000],  A  version  of  RELSAT 
has  also  been  used  to  do  model  counting  on  SAT  instances  in 
CNF.  If  a  propositional  theory  is  in  a  special  form  called  the 
smooth,  deterministic,  decomposable,  negation  normal  form 
(sd-DNNF),  then  model  counting  can  be  made  tractable  and 
incremental  [Darwiche,  2001]. 


7  Summary  and  Future  Work 

In  this  paper,  we  provided  a  unifying  characterization  of  diag¬ 
noses  based  on  the  idea  of  model  counting.  In  the  process,  we 
compared  and  contrasted  our  formalization  with  the  previous 
approaches  —  in  many  cases,  removing  the  problems  asso¬ 
ciated  with  them.  Because  model  counting  bridges  the  gap 
between  probabilities  and  constraints  and  is  well-defined  for 
many  representational  forms  of  information  available  about 
the  system,  we  believe  that  the  model  counting  characteri¬ 
zation  of  diagnoses  is  useful  and  general  in  the  sense  of  not 
imposing  any  restrictions  on  the  representational  form  of  the 
system  description. 

As  for  our  future  work,  we  are  in  the  process  of  investi¬ 
gating  and  developing  computationally  tractable  algorithms 
based  on  the  model  counting  characterization  of  diagnoses. 
Advances  in  model  counting  algorithms  (approximate  count¬ 
ing,  structure-based  methods  etc.)  seem  to  be  encouraging 
towards  this  goal.  We  are  also  working  on  variants  of  the 
diagnosis  problem  (e.g.  when  we  are  interested  in  a  set  of 
candidate  hypotheses  rather  than  just  one). 
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Computing  Minimal  Hitting  Sets  with  Genetic 

Algorithm 

Lin  Li1  2  and  Jiang  Yunfei1 2 


Abstract.  A  set  S  that  has  a  non-empty  intersection  with  every 
set  in  a  collection  of  sets  C  is  called  a  hitting  set  of  C.  If  no 
element  can  be  removed  from  S  without  violating  the  hitting  set 
property,  S  is  considered  to  be  minimal.  Several  interesting 
problems  can  be  partly  formulated  as  ones  that  a  minimal 
hitting  set  or  more  ones  have  to  be  found.  Many  of  these 
problems  are  required  for  proper  solutions,  but  sometimes  the 
approximate  solutions  are  enough.  A  genetic  algorithm  and 
advantaged  algorithms  were  devised  for  computing  minimal 
hitting  sets.  An  improvement  makes  them  get  most  minimal 
hitting  sets  efficiently.  Furthermore,  they  are  smaller,  i.e.  fewer 
rules. 


1  INTRODUCTION 

A  lot  of  theoretical  and  practical  problems,  e.g.,  [1-8],  can  be 
partly  reduced  to  an  instance  of  the  minimal  hitting  set  or  one 
of  its  relatives,  such  as  the  minimum  set  cover  problem,  model- 
based  diagnosis  [1-5, 7-8],  and  teachers  and  courses  problem. 

Normally  speaking,  it  is  a  problem  of  selecting  a  minimal  set 
(e.g.,  of  teachers)  that  has  a  non-empty  intersection  with  each 
set  (e.g..  list  of  courses).  That  is  to  say,  there  is,  at  least,  one 
teacher  who  can  teach  any  courses.  This  is  a  formulation  of  the 
minimal  hitting  set  problem,  which,  in  general,  is  NP-hard  [6], 

Generally,  there  are  a  number  of  hitting  sets,  but  sometimes 
we  only  need  one  or  some  of  them.  There  are  some  algorithms 
[1-8]  for  computing  all  of  the  minimal  hitting  sets,  the  space 
and  time  efficiency  are  not  ideal.  We  present  a  novel  method 
based  on  the  Genetic  Algorithm  (in  short  GA  here)  for 
calculating  minimal  hitting  sets. 

Definition  1.  (Hitting  sets) 

Given  a  collection  C={5,  I  /□  N  }  of  sets  of  elements  from 
some  universe  U,  a  hitting  set  is  a  set  S  c  U  such  that  SITS',  ^0 
,  for  all  i,  i.e.,  a  set  which  contains,  at  least,  one  element 
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from  all  sets  in  C.  Let  HS(C)  denote  the  collection  of  all  hitting 
subsets  in  HS(C).  These  are  called  the  minimal  hitting  sets  of 
C. 

We  introduce  a  minimizing  operator  fl  [5],  MHS(C)=  /H 
(HS(C)).  We  will  use  /LI  to  get  minimal  conflict(/hitting)  sets 
from  conflict(/hitting)  sets. 

Determining  a  minimal  cardinality  element  of  MHS(C)  is 
called  the  minimal  hitting  set  problem. 

Example  1.  Model-based  diagnosis  [1],  as  shown  in  Figure 
1.  Suppose  conflict  sets  are  {Ml,  M2,  Al},  {Ml,  Al,  A2,  M3}. 
The  minimal  hitting  sets  (diagnosis)  are  {Ml},  {Al},  {M2, 
A2},  {M2,  M3}. 

{Ml },  { Al }  are  of  minimal  cardinality. 


Figure  1  A  simple  circuit  with  3  multipliers  and  2  adders. 

A  minimal  cardinality  hitting  set  is  a  minimal  hitting  set  of 
minimal  cardinality. 

In  case  of  large  sets  of  conflicts,  the  computation  of  the 
hitting  sets  will  result  both  time  and  space  consumption.  Shown 
in  Figure  2. 

There  are  about  millions  of  components.  For  example,  in 
vehicles,  computer  systems,  power  plants,  aircrafts,  etc,. 
Therefore,  we  developed  a  novel  efficient  GA  to  compute 
minimal  hitting  sets.  When  the  scale  of  conflicting  sets  is 
getting  large,  the  GA  method  can  still  be  used  for  computing 
the  minimal  hitting  sets  in  a  very  short  time. 


2  GENETIC  ALGORITHM 

Genetic  algorithm  is  a  heuristic  for  the  function  optimization, 
where  the  extreme  of  the  function  (i.e.,  minimal  or  maximal) 
cannot  be  analytically  established.  A  population  of  potential 
solution  is  refined  iteratively  by  employing  a  strategy  inspired 
by  Darwinist  evolution  or  natural  selection.  Genetic  algorithms 
promote  “survival  of  the  fittest’’.  This  type  of  heuristic  has  been 
applied  in  many  different  fields,  including  construction  of 
neural  networks  and  multi-disorder  diagnosis. 

For  the  minimal  hitting  set  problem,  a  straightforward  choice 
of  population  is  a  set  P  of  elements  from  2  ,  encoded  as  bit- 
vectors,  where  each  bit  indicates  the  presence  of  a  particular 
element  in  the  set. 

Example  2.  (Teacher  and  course  problem)  Let  C  denote  a  set 
cluster  containing, 

5,={1,  2,  3,  4},  52={1,  2,  4},  53={  1,  2},  54={2,  3},  5S={4}. 

It  means  that  there  are  5  courses  {Si,  52,  53,  54,  55  }  and  4 
teachers  1,  2,  3,  4.  Teachers  1,  2,  3  and  4  can  teach  course  Si, 
teachers  1,  2,  4  can  teach  course  S2,  ...  ,  teacher  4  can  teach 
course  S5.  We  want  to  find  the  least  teachers  who  can  teach  all 
of  the  5  courses.  This  is  a  minimal  hitting  sets  problem,  and  the 
minimal  hitting  sets  are:  //i={l,  3,  4},  H2={ 2,  4}. 

We  use  bi-vectors  to  represent  the  sets  and  their  hitting  sets, 
these  bi-vectors  are  called  “chromosomes’’,  each  bit  is  called 
“gene”,  and  all  of  the  “chromosomes”  are  called  “population”. 

If  we  use  chromosome  to  represent  the  sets,  they  are 
represented  as  follow: 

Si={l,  1,  1,  1),  52={1,  1,  0,  1},  53={1,  1,  0,  0},  54={0,  1,  1, 
0},55={0,  0,  0,  1). 

The  hitting  sets  are:  H3={  1,  0,  1,  1 }.  H2={ 0,  1,  0,  1 }. 

Here,  I  5,1^1  □5,1,  I  //, 15305,1,  so,  the  length  of  chromosomes 
equals  to  105,1. 

Genetic  operations  include:  “crossover”,  “mutation”, 
“inversion”,  “selection”  and  “obtain”. 

Suppose  that  minimal  conflict  sets  cluster  is  C={5i,  52,  ...  , 
5„},n=ID5t-l. 

“Crossover”  operator.  Suppose  that  .Sj={.yM,  ,sl2,  ...  ,  .y,„{, 
52={i2i,  s22,  ...  ,  ■s'2n}.  are  two  chromosomes,  select  that  a 
random  integer  number  0 <r<n,  S3,  S4  is  offspring  of 
crossover!  5i,  52), 

53={s,  I  if  ter,  ,s,D5i,  else  0S2}, 

54=  { v,  I  if  ter,  ,s,0S2,  else  Sj  053 } . 

“Mutation”  operator.  Suppose  that  a  chromosome  5i={vn, 
Si2,  ...  ,  Jin},  selecting  a  random  integer  number  0 <ten,  S3  is 
mutation  of  Si, 

53={v,  I  if  ter,  then  ,s,=.S|,,  else  st  =1-^1;}. 

“Inversion”  operator.  Suppose  that  chromosome  5i={vn, 
sn,  ...  ,  Ji, ,  J'l.r+i,  ....  sUr+l,  su+l+ 1,  ...  ,sin},r,  l  are  random 
numbers,  S2  is  the  inversion  of  Si. 

52={Sll,  S\2,  ...  ,  S\r  ,  r+i,  ...  ,  Si'f+u  Jl.r+z+l,  ...  ,  Si„  }. 

“Selection”  operator.  Suppose  that  there  are  m  sets,  we 
select  [m/2]  sets  and  eliminate  other  sets,  the  sets  we  selected 
are  both  "fitness”  and  "minimal”,  i.e.  first,  they  intersect  more 
sets  than  the  other,  and  second,  their  cardinality  is  smaller. 


“Obtain”  operator.  Suppose  that  there  is  a  singleton  set  in 
the  set  cluster,  then  all  hitting  sets  must  hits  this  set.  i.e.  the 
gene  stands  for  this  set  must  be  always  kept  as  “1”  ,  we  refer  to 
this  operator  as  “obtain”: 

"Obtain”  operator  has  no  any  influence  on  the  result,  it  can 

improve  the  efficiency,  such  as  a  giraffe  obtains  "long  neck”. 

So  they  can  be  competed  under  the  “  long  neck”  condition. 

Genetic  algorithm. 

1.  InitializePopulation:  Obtain  A*ICI*ID5,I  population 
randomly,  each  chromosome  is  an  n-length  array.  A:  is  a  const 
coefficient. 

2.  Testing  if  one  of  the  stopping  criteria  (time,  fitness,  etc) 
holds.  If  it  is  yes,  the  procedure  can  be  stopped,  here, 100 
generations  are  gotten 

3.  Selection:  Selecting  one  of  chromosome;  testing  its 
fitness,  here,  being  the  number  of  sets  it  hits.  Keeping  the  most 
fitness  ones  and  deleting  the  bad  ones. 

4.  Applying  the  genetic  operator:  such  as  “crossover”, 
“mutation”,  “inversion”  and  “obtain”  to  the  selected  parents  to 
form  offspring. 

5.  Recombining  the  offspring  and  current  population  to  form 
a  new  population  with  “selection”  operator. 

6.  Repeating  steps  2-5. 

Also,  we  can  use  Genetic  Algorithm  to  compute  MINIMAL 
hitting  sets  from  hitting  sets. 

In  step  3.  If  we  get  hitting  sets,  we  can  undergo  mutation 
operator  just  to  change  sr  from  “1”  to  “0”  in  order  to  get  its 
offspring,  else,  we  undergo  mutation  operator  just  to  change  sr 
from  “0”  to  “1”  in  order  to  get  its  offspring.  In  the  next 
selection  operator  we  will  go  on  keeping  hitting  sets  because 
they  are  more  fitting. 

In  the  end,  we  will  get  4  sets  as  follow: 

1.  Minimal  hitting  sets; 

2.  Both  minimal  hitting  sets  and  their  super-hitting  sets;  we 
will  use  operator  p  to  delete  the  super-hitting  sets; 

3.  Hitting  sets,  but  not  minimal,  their  sub-hitting  sets  are  not 
gotten; 

4.  No  hitting  sets,  these  sets  will  be  deleted  by  “selection” 
operator. 

But.  in  fact,  the  situation  3  is  never  gotten  by  GA  test 
program. 

We  can  get  about  95  percent  minimal  hitting  sets  with  GA. 
(shown  in  Figure  2) 

3  COMPARISON 

We  have  written  a  program  to  compare  among  HS-tree,  BHS- 
tree  [8]  and  GA;  the  result  is  shown  in  Figure  3  and  Figure  4. 
The  elements  of  every  conflict  sets  are  between  1  and  20. 

In  general,  GA  can  get  more  than  95  percent  minimal  hitting 
sets  in  100  generations,  when  the  set  cluster  is  big,  then  the  HS- 
tree  and  BHS-tree  can  not  run  because  of  "Out  of  memory”, 
but.  GA  can  get  almost  all  minimal  hitting  sets  efficiently. 

The  space  complexity  of  HS-tree  is  about  0(m"),  m  is  the 
average  of  15,1,  n  is  Id,  That  of  BHS-tree  is  about  0(2  ^''' ), 


that  of  GA  is  about  0(«ID5',i).  So  the  efficiency  of  GA  is  better 
than  that  of  HS-tree  and  BHS-tree. 


The  number  of  conflict  sets 


-  BHS-tree 


-  HS-tree 


-GA 


hitting  set. 

Example  3.  (Continue  to  Example  2) 

If  we  get  //3={1,  1,0,  1}  and  know  that  it  is  a  hitting  set, 
then  we  undergo  “mutation”  operator  to  it,  however,  we  only 
change  “1”  into  “0”  here. 

H3={  1,  1,  0,  1}— >{0,  1,  0,  1},  (minimal  hitting  set) 

— >{  1,  0,  0,  1 },  (no  hitting  set) 

— >{1,  1,  0,  0}.  {no  hitting  set} 

Underlined  genes  stand  for  “mutation”  from  parent  genes. 

3.  Although  this  algorithm  can’t  get  all  of  the  minimal  hitting 
sets,  but  after  we  replace  or  repair  these  components  we  have 
computed,  we  can  do  next  diagnosis  step  by  step.  The  next 
research  is  GA  used  in  choice  of  a  repair/replace  action  on  the 
set  of  suspects  or  choice  of  a  next  measurement. 

This  GA  can  be  used  in  many  other  fields,  e.g.  a  librarian  can 
decide  what  kind  of  journals  referred  by  researchers  will  be 
purchased  under  lack  of  funds.  [6,  ppl24]. 


Figure  2  Running  time  among  BHS-tree,  HS-tree  and  GA. 
(CPU-PII 667,  128M  main  memory,  C++.  Windows’98) 

The  comprasion  of  BHS  and  GA 


The  number  of  hitting  sets 


Figure  3  The  hitting  sets  number  and  the  percentage  of  GA 
gets. 

4  CONCLUSIONS 

In  this  paper,  we  have  improved, 

1 .  When  the  conflict  sets  scale  gets  big.  This  GA  algorithm 
may  get  most  of  minimal  hitting  sets  in  a  relative  short  time 
and  small  memory,  but  the  other  algorithm  can’t  get  the  hitting 
sets  because  of  “out  of  memory”. 

2.  The  GA  algorithm  can  also  get  MINIMAL  hitting  sets.  If  a 
chromosome  is  not  a  hitting  set,  and  the  “mutation”  operator 
just  changes  a  random  gene  from  “0”  into  “1”,  else  change  a 
random  gene  form  “1”  into  “0”  so  that  we  can  get  minimal 
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Abstract. 

The  Infrastructure  of  modern  society  is  controlled  by  software  sys¬ 
tems  that  are  vulnerable  to  attack.  Successful  attacks  on  these  sys¬ 
tems  can  lead  to  catastrophic  results;  the  survivability  of  such  infor¬ 
mation  systems  in  the  face  of  attacks  is  therefore  an  area  of  extreme 
importance  to  society.  This  paper  presents  model-based  techniques 
for  the  diagnosis  of  potentially  compromised  software  systems;  these 
techniques  can  be  used  to  aid  the  self-diagnosis  and  recovery  from 
failure  of  critical  software  systems.  It  introduces  Information  Surviv¬ 
ability  as  a  new  domain  of  application  for  model-baesed  diagnosis 
and  it  presents  new  modeling  and  reasoning  techniques  relevant  to 
the  domain.  In  particular:  1)  We  develop  techniques  for  the  diag¬ 
nosis  of  compromised  software  systems  (previous  work  on  model- 
base  diagnosis  has  been  primarily  cconcerned  with  physical  compo¬ 
nents);  2)  We  develop  methods  for  dealing  with  model-based  diagno¬ 
sis  as  a  mixture  of  symbolic  and  Bayesian  inference;  3)  We  develop 
techniques  for  dealing  with  common-mode  failures;  4)  We  develop 
unified  representational  techniques  for  reasoning  about  information 
attacks,  the  vulnerabilities  and  compromises  of  computational  re¬ 
sources,  and  the  observed  behavior  of  computations;  5)  We  highlght 
additional  information  that  should  be  part  of  the  goal  of  model-based 
diagnosis. 

1  Background  and  Motivation 

The  infrastructure  of  modern  society  is  controlled  by  computational 
systems  that  are  vulnerabile  to  information  attacks.  The  system  and 
application  software  of  these  systems  possess  vulnerabilities  that  en¬ 
able  attacks  capable  of  compromising  the  resources  used  by  the  soft¬ 
ware  systems.  A  skillful  attack  could  lead  to  consequences  as  dire  as 
those  of  modern  warfare.  In  every  exercise  conducted  by  the  govern¬ 
ment  so  far.  the  attacking  team  has  managed  to  completely  the  target 
systems  with  little  difficulty.  There  is  a  dire  need  for  new  approaches 
to  protect  the  computational  infrastructure  from  such  attacks  and  to 
enable  it  to  continue  functioning  even  when  attacks  have  been  suc¬ 
cessfully  launched. 

Our  presmise  is  that  to  protect  the  infrastructure  we  need  to  re¬ 
structure  these  software  systems  as  Adaptive  Survivable  Systems.  In 

1  This  article  describe  research  conducted  at  the  Artificial  Intelligence  Lab¬ 
oratory  of  the  Massachusetts  Institute  of  Technology.  Support  for  this  re¬ 
search  was  provided  by  the  Information  Systems  Office  of  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  (DARPA)  under  Space  and  Naval  War¬ 
fare  Systems  Center  -  San  Diego  Contract  Number  N66001-00-C-8078. 
The  views  presented  are  those  of  the  author  alone  and  do  not  represent 
the  view  of  DARPA  or  SPAWAR. 


particular,  we  believe  that  a  software  system  must  be  capable  of  de¬ 
tecting  its  own  malfunction  and  it  must  be  capable  of  repairing  itself. 
But  this  means  that  it  must  first  be  able  to  diagnose  the  form  of  the 
failure;  in  particular,  it  must  both  localize  and  characterize  the  break¬ 
down. 

Our  work  is  set  in  the  difficult  context  in  which  there  is  a  con¬ 
certed  and  coordinated  attack  by  a  determined  adversary.  This  con¬ 
text  places  an  extra  burden  on  the  diagnostic  component.  It  is  no 
longer  adequate  merely  to  determine  which  component  of  a  com¬ 
putation  has  failed  to  achieve  its  goal,  in  addition  we  wish  to  de¬ 
termine  whether  that  failure  is  indicative  of  a  compromise  to  the 
underlying  infrastructure  and  whether  that  compromise  is  likely  to 
lead  to  failures  of  other  computations  at  other  times.  Furthermore, 
we  wish  to  determine  what  kind  of  attack  compromised  the  resource 
and  whether  this  attack  is  likely  to  have  compromised  other  resources 
that  share  a  vulnerability.  This  paper  focuses  on  the  diagnostic  com¬ 
ponent  of  self  adaptive  survivable  systems. 

2  Contributions  of  this  Work 

We  build  on  previous  work  in  Model-Based  diagnosis  [2,  3,  4.  5,  8], 
However,  the  context  of  our  research  is  significantly  different  from 
that  of  the  prior  research,  leading  us  to  confront  several  important 
issues  that  have  not  previously  been  addressed.  In  particular,  we 
present  several  new  advances  in  representation  and  reasoning  tech¬ 
niques  for  model-based  diagnosis: 

1.  We  develop  representation  and  reasoning  techniques  for  describ¬ 
ing  and  reasoning  about  the  behaviors  and  failures  of  software  sys¬ 
tems  (most  previous  work  has  focussed  on  hardware,  particularly 
digital  hardware). 

2.  We  develop  mixed  symbolic  and  Bayesian  reasoning  technique 
for  model-based  diagnosis.  The  statistical  component  of  the  tech¬ 
nique  utilizes  Bayesian  networks  to  calculate  accurate  posterior 
probabilities. 

3.  We  develop  a  unified  framework  for  reasoning  about  the  failures 
of  the  computations,  about  how  these  failures  are  related  to  com¬ 
promises  of  the  underlying  resources,  about  the  vulnerabilities  of 
these  resources  and  how  these  vulnerabilities  enable  attacks. 

4.  We  develop  techniques  for  reasoning  about  common-mode  fail¬ 
ures.  A  common-mode  failure  occurs  when  the  probabilites  of  the 
failure  modes  of  two  or  more  components  are  not  independent. 
This  issue  has  not  been  substantially  addressed  in  the  previous  lit¬ 
erature  on  model-based  diagnosis. 


5.  We  develop  diagnostic  techniques  that  lead  to  an  estimate  of  the 

trustability  of  the  computational  resources  that  are  used  in  a  spe¬ 
cific  computation.  These  techniques  also  help  us  to  assess  which 

attacks  have  occurred  and  the  likelihood  that  specific  attacks  have 

been  successful. 

These  are  crucial  issues  when  failure  is  caused  by  a  concerted  and 
coordinated  attack  by  a  malicious  opponent.  There  are  many  modes 
of  attacking  computational  systems  but  the  most  pernicious  attack¬ 
ers  seek  to  avoid  detection;  therefore  they  attempt  to  scaffold  the  at¬ 
tack  slowly,  at  a  nearly  undetectable  rate.  These  scaffolding  actions 
will  typically  appear  as  minor  misbehaviors  (i.e.  they  will  cause  the 
system  to  behave  somewhat  outside  its  normal  range),  but  skillful 
attackers  will  space  out  the  attacks  so  that  the  misbehaviors  are  in¬ 
frequent  and  they  will  attempt  to  make  the  resulting  misbehaviors 
seem  as  close  to  normal  behavior  as  possible.  This  makes  it  crucial 
that  our  diagnostic  techniques  be  capable  of  extracting  information 
from  low-frequency  events  that  closely  resemble  normal  modes  of 
operation. 

Attackers  aim  at  high  leverage  points  of  the  infrastructure,  such 
as  operating  systems  or  middleware.  This  leads  to  common-mode 
faults,  because  once  the  operating  system  has  been  compromised  all 
application  components  can  be  caused  to  fail  simultaneously. 

The  paper  first  briefly  reviews  the  current  state  of  the  art  in  model- 
based  diagnosis;  this  work  has  mainly  been  concerned  with  break¬ 
downs  caused  by  the  deterioration  of  hardware  components.  In  par¬ 
ticular,  we  adopt  the  framework  in  [4]  where  each  component  has 
models  for  each  of  several  behavioral  modes  and  each  model  is  given 
a  probability.  We  will  then  turn  to  the  question  of  how  to  extend  these 
techniques  so  as  to  apply  them  to  the  diagnosis  of  software  systems. 
We  extend  our  modeling  framework  to  account  for  the  fact  that  soft¬ 
ware  systems  are  built  in  layers  of  infrastructure,  with  compromises 
to  one  layer  affecting  all  higher  levels.  A  software  system  has  a  great 
deal  of  hidden  state;  what  we  are  actually  capable  of  observing  is  the 
behavior  of  a  specific  computation ;  but  this  particular  computation 
uses  a  variety  of  resources  (e.g.  the  operating  system  and  middle¬ 
ware,  data-sets,  etc.).  These  resources  may  have  been  subject  to  a 
variety  of  compromises ,  each  of  which  might  lead  to  a  different  mis¬ 
behavior  of  the  computation.  Compromises  to  the  resources  occur 
because  the  resources  possess  vulnerabilities  that  allow  specific  at¬ 
tacks  to  take  control  of  the  resources  for  purposes  other  than  those 
intended  by  the  original  designers. 

We  will  finally  present  mixed  symbolic  and  statistical  diagnostic 
algorithms  for  assessing  the  posterior  probabilities  of  the  various  be¬ 
havior  modes  of  each  component  in  the  model.  We  present  an  imple¬ 
mentation  and  show  an  example  of  the  reasoning  process.  Finally,  we 
discuss  the  demands  placed  on  the  diagnostic  component  by  our  goal 
of  self-adaptivity  and  conclude  with  suggestions  for  future  research. 

3  Related  Research 

Model-Based  Diagnosis  is  a  symptom  directed  technique;  it  is  driven 
by  the  detection  of  discrepancies  between  the  observations  of  actual 
behavior  and  the  predictions  of  a  model  of  the  system.  Almost  all  of 
the  reported  work  in  the  area  [2,  1,  3,  4,  5,  8]  has  been  concerned 
with  the  diagnosis  of  physical  systems  subject  to  routine  breakdown. 
Model-based  diagnostic  systems  use  simulation  models  that  compute 
expected  outputs  given  known  inputs;  they  utilize  dependency  di¬ 
rected  techniques  to  link  each  intermediate  and  final  value  to  the  se¬ 
lected  behavioral  model  of  any  component  of  the  system  which  was 
involved  in  producing  that  value. 


The  completeness  of  the  diagnostic  process  is  dependent  on  having 
bi-directional  simulation  models  for  each  component  of  the  system. 
Such  models  produce  both  a  set  of  assertions  recording  what  values 
are  expected  and  a  dependency  network  linking  these  assertions  to 
one  another  and  to  assertions  stating  which  components  must  be  in  a 
particular  behavioral  mode  for  those  values  to  appear. 

Our  work  builds  on  the  framework  in  Sherlock  [4]  and  on  the  prob¬ 
abilistic  techniques  in  [8],  In  Sherlock  the  description  of  a  component 
includes  multiple  simulation  models,  one  for  each  behavioral  mode 
of  the  component.  One  distinguished  mode  is  the  normal  mode,  but 
behavioral  models  for  known  failure  modes  may  also  be  provided.  It 
is  also  typical  to  include  a  null  model  to  account  for  unknown  modes 
of  behavior.  Finally,  each  of  the  behavioral  modes  of  a  component  is 
assigned  an  a  priori  probability.  Sherlock  uses  these  to  guide  a  best 
first  search  for  a  set  of  behavioral  modes,  one  for  each  component, 
such  that  the  models  for  those  modes  predict  the  observed  behavior. 
This  is  the  most  likely  diagnosis.  However,  these  techniques  i  de¬ 
pend  on  the  assumption  that  the  failure  modes  of  the  components  are 
independent;  as  we  will  see  this  assumption  doesn’t  hold  in  our  envi¬ 
ronment.  Later  work  [8]  introduced  techniques  for  applying  Bayesian 
networks  in  the  context  of  model-based  diagnosis,  allowing  depen¬ 
dencies  to  be  modeled;  [10]  presents  techniques  within  this  frame¬ 
work  for  generating  several  likely  diagnoses  in  order  of  decreasing 
likelihood. 

Because  our  focus  is  on  detecting  the  intentional  compromise  of 
software  components  we  are  forced  to  face  a  number  of  new  issues. 
These  include:  How  to  model  software  components  in  the  spirit  of 
model-based  diagnosis;  How  to  deal  with  the  fact  that  a  compromise 
to  the  computational  infrastructure  (e.g.  the  operating  system)  can 
manifest  itself  in  the  malfunction  of  many  application  components; 
How  to  deal  with  the  fact  that  compromised  components  may  behave 
in  ways  that  are  difficult  to  distinquish  from  normal  behavior;  How 
to  reason  about  the  system  so  as  to  extract  as  much  information  about 
possible  compromises  as  we  can.  In  particular,  we  deal  with  how  to 
use  both  symbolic  and  Bayesian  techniques. 

4  Modeling  Software  Computations 

Model-Based  Diagnosis  requires  completely  invertible  models  of  the 
components  in  order  to  guarantee  completeness  of  its  analysis.  But 
the  components  of  a  complex  software  system  rarely  have  input- 
output  relationship  that  are  invertible.  We  therefore  look  for  other, 
additional  properties,  that  lead  to  more  complete  coverage.  In  partic¬ 
ular,  we  concentrate  here  on  descriptions  of  computational  delay  (or 
other  Quality  Of  Service  metrics).  In  our  current  implementation  we 
use  an  interval  of  expected  delay  times  (i.e.  the  computation  should 
run  no  slower  than  x  and  no  faster  than  y)  as  the  behavioral  mod¬ 
els.  Figure  1  shows  the  application  of  such  models  in  a  framework 
similar  to  Sherlock.  When  propagating  in  the  forward  direction  we 
add  the  delay  interval  predicted  by  the  behavioral  model  to  the  in¬ 
terval  bounding  the  arrival  time  of  the  latest  input.  In  the  backward 
direction,  we  use  interval  subtraction  (and  only  update  the  bounds 
on  the  last  input  to  arrive).  When  more  than  one  component  predicts 
the  bounds  for  a  particular  value  (e.g.  when  a  model  for  component 
A  and  a  model  for  component  C  both  predict  bounds  for  the  value 
labeled  MID),  we  take  the  intersection  of  the  two  intervals  to  ob¬ 
tain  the  tightest  bounds  implied  by  the  overall  model.  A  discrepancy 
is  detected  when  the  lower  bound  of  an  interval  exceeds  the  upper 
bound. 

As  in  Sherlock  we  provide  several  behavioral  models  for  each 
component,  one  characterizing  normal  behavior,  others  characteriz- 


ing  known  failure  modes  and  a  null  model  to  cover  all  other  unex¬ 
pected  behaviors. 

Notice  that  in  Figure  1  there  are  six  potential  diagnoses,  only  one 
of  which  involves  a  single  point  of  failure  (in  component  C).  The  oth¬ 
ers  involve  multiple  failues  with  one  component  running  slower  than 
expected  and  other  components  masking  the  fault  at  Outl  by  run¬ 
ning  faster  than  expected.  In  the  third  diagnosis,  component  A  runs 
in  “negative  time”!  On  the  surface,  such  a  diagnosis  seems  physi¬ 
cally  impossible  and  we  might  expect  the  diagnostic  algorithm  to  re¬ 
ject  it.  But,  the  diagnosis  algorithm  is  guided  by  our  representational 
choices;  the  reason  this  diagnosis  involves  negative  time  is  that  the 
fast  behavioral  model  of  component  A  predicts  a  delay  interval  from 
-30  to  +2. 

Such  behavior  seems  very  unlikely,  and  indeed  we  assign  a  low 
likelihood  to  this  model;  however,  it  is  not  impossible.  Suppose  that 
both  computations  A  and  C  are  running  on  the  same  computer  and 
further  suppose  that  the  computer  has  been  compromised  by  an  at¬ 
tacker.  Under  these  circumstances,  it’s  not  impossible  for  component 
C  to  be  delayed  (because  of  a  parasitic  task  inserted  by  the  attacker) 
while  component  A  has  been  accelerated,  running  in  less  than  zero 
time  because  it  has  been  hacked  by  the  attacker  to  send  out  reason¬ 
able  answers  before  it  receives  its  inputs. 

What  we  are  able  to  observe  is  the  progress  of  a  computation; 
but  the  computation  is  itself  just  an  abstraction.  What  an  attacker 
can  actually  affect  is  something  physical:  the  file  representing  the 
stored  version  of  a  program,  the  bits  in  main  memory  representing  the 
running  program,  or  other  programs  (such  as  the  operating  system) 
whose  services  are  employed  by  the  monitored  application. 

Thus,  we  require  a  more  elaborated  modeling  framework  detailing 
how  the  behavior  of  a  computation  is  related  to  the  state  of  the  re¬ 
sources  that  it  uses.  In  turn,  we  must  represent  the  vulnerabilities  of 
these  resources  and  the  attacks  enabled  by  these  vulnerabilities.  Fi¬ 
nally,  we  must  represent  how  such  attacks  compromise  the  resources, 
causing  them  to  behave  in  an  undesired  manner. 

5  Common  Mode  Failures 

A  single  compromise  of  an  operating  system  component,  such  as 
the  scheduler,  can  lead  to  anomalous  behavior  in  several  application 
components.  This  is  an  example  of  a  common  mode  failure ;  intu¬ 
itively,  a  common  mode  failure  occurs  when  a  single  fault  (e.g.  an  in¬ 
accurate  power  supply),  leads  to  faults  at  several  observable  points  in 
the  systems  (e.g.  several  transistors  misbehave  because  their  biasing 
power  is  incorrect).  Another  example  comes  from  reliability  studies 
of  nuclear  power  plants  where  it  was  observed  that  the  catastrophic 
failure  of  a  turbine  blade  could  sever  several  pipes  as  it  flies  off,  lead¬ 
ing  to  multiple  cooling  fluid  leaks. 

Formally,  there  is  a  common  mode  failure  whenever  the  probabili¬ 
ties  of  the  failure  modes  of  two  (or  more)  components  are  dependent. 
Early  model-based  diagnostic  systems  have  assumed  probabilistic  in¬ 
dependence  of  the  behavior  modes  of  different  components  [4]  in  or¬ 
der  to  simplify  the  assessment  of  posterior  probabilities.  Later  work 
[8]  allows  for  probabilistic  dependence;  however,  it  does  not  explore 
in  detail  how  to  model  the  causes  of  this  dependence.  We  deal  with 
common  mode  failures  by  extending  our  modeling  framework  to 
make  explicit  the  mechanisms  that  couple  the  failure  probabilites  of 
different  components. 

We  first  extend  our  modeling  framework,  as  shown  in  Figure  2, 
to  include  two  kinds  of  objects:  computational  components  (repre¬ 
sented  by  a  set  of  delay  models  one  for  each  behavioral  mode)  and 
infrastructural  components  (represented  by  a  set  of  modes,  but  no  de¬ 


lay  or  other  behavioral  models).  Connecting  these  two  kinds  of  mod¬ 
els  are  conditional  probability  links;  each  such  link  states  how  likely 
a  particular  behavioral  mode  of  a  computational  component  would  be 
if  the  infrastructural  component  that  supports  that  component  were  in 
a  particular  one  of  its  modes  (normal  or  abnormal).  Each  infrastruc¬ 
tural  component  mode  will  usually  project  conditional  probability 
links  to  more  than  one  computational  component  behavioral  mode, 
allowing  us  to  say  that  normal  behavior  has  some  probability  of  be¬ 
ing  exhibited  even  if  the  infrastructural  component  has  been  com¬ 
promised  (however,  for  simplicity,  figure  2  shows  only  a  one-to-one 
mapping). 

The  model  also  includes  a  priori  probabilities  for  the  modes  of 
the  infrastructural  components,  representing  our  best  estimates  of  the 
degree  of  compromise  in  each  such  piece  of  infrastructure.  Following 
a  session  of  diagnostic  reasoning,  these  probabilities  may  be  updated 
to  the  value  of  the  posterior  probabilities. 

We  next  observe  that  resources  are  compromised  by  attacks.  At¬ 
tacks  are  enabled  by  vulnerabilities  in  the  resources.  For  example, 
many  systems  in  the  Unix  family  are  vulnerable  to  buffer-overflow 
attacks;  most  networked  systems  are  vulnerable  to  packet-flood  at¬ 
tacks.  An  attack  is  capable  of  compromising  a  resource  in  a  variety  of 
ways;  for  example,  buffer  overflow  attacks  are  used  both  to  gain  con¬ 
trol  of  a  specific  resource  and  to  gain  root  access  to  the  entire  system. 
But  the  variety  of  compromises  enabled  by  an  attack  are  not  equally 
likely  (some  are  much  more  difficult  than  others).  We  therefore  add 
a  third  tier  to  our  model  to  describe  the  ensemble  of  attacks  assumed 
to  be  available  in  the  environment.  We  connect  the  attack  layer  to  the 
resource  layer  with  Conditional  probability  links  that  state  the  like- 
likhood  of  each  mode  of  the  compromised  resource  once  the  attack 
has  been  successful. 

Our  model  of  the  computational  environment  therefore  includes: 

•  The  components  of  the  computation  that  is  being  observed 

•  A  set  of  behavioral  models  for  each  component,  representing  both 
normal  and  failure  modes. 

•  The  set  of  resources  available  to  be  used  by  the  computational 
components 

•  A  set  of  behavioral  modes  for  each  resource,  representing  both 
normal  and  compromised  modes. 

•  A  map  stating  which  resources  are  used  by  each  computational 
component. 

•  Conditional  probabilties  linking  the  modes  of  the  computations  to 
the  modes  of  the  resources  employed  by  that  component. 

•  A  list  of  vulnerabilities  possessed  by  each  computational  resource. 

•  A  description  of  which  attacks  are  enable  by  each  vulnerability. 

•  A  list  of  attack  types  that  are  believed  to  be  active  in  the  environ¬ 
ment. 

•  A  description  of  which  compromised  modes  of  each  type  of  re¬ 
source  can  be  caused  by  a  successful  execution  of  each  type  of 
attack.  This  is  provided  as  a  set  of  conditional  probabilities  of  the 
compromised  mode  given  the  execution  of  the  attack. 

Given  this  information,  simple  rule-based  inferencing  (imple¬ 
mented  in  the  Joshua  inference  system)  deduces  which  specific  re¬ 
sources  might  have  been  compromised  and  with  what  probability. 
This  information  is  then  used  to  construct  a  Bayesian  network  (in  the 
IDEAL  system). 

6  Diagnostic  Reasoning 

Figure  3  shows  a  model  of  a  fictitious  distributed  financial  system 
which  we  use  to  illustrate  the  reasoning  process.  The  system  con- 


sists  of  five  interconnected  software  modules  (Web-server,  Dollar- 
Monitor,  Bond-Trader,  Yen-Monitor,  Currency-Trader)  utilizing  four 
underlying  computational  resources  (WallSt-Server,  JPMorgan,  Bon- 
dRUs,  Trader-Joe). 

For  each  computational  component  we  show  the  conditional  prob¬ 
ability  tables  that  describe  how  the  behavioral  modes  of  each  compu¬ 
tational  resource  probabilistically  depend  on  the  modes  of  the  under¬ 
lying  resources  (each  resource  has  two  modes,  normal  and  hacked). 
Note  that  two  computations  (Dollar-Monitor  and  Yen-Monitor)  are 
supported  by  a  common  resource  (JPMorgan)  and  compromises  to 
this  underlying  resource  are  likely  to  affect  both  computations.  The 
failure  modes  of  these  two  computations  are  no  longer  independent; 
this  is  indicated  by  the  conditional  probabilities  connecting  the  be¬ 
havior  modes  of  the  JPMorgan  to  those  of  both  Dollar-Monitor  and 
Yen-Monitor.  The  specific  conditional  probabilites  supplied  describe 
the  degree  of  coupling. 

Finally  we  show  the  a  priori  probabilities  for  the  modes  of  the 
underlying  resources.  However,  when  attacks  are  present  in  the  en¬ 
vironment  what  matters  is  the  conditional  probabilities  of  the  dif¬ 
ferent  modes  of  the  resources  given  that  an  attack  has  taken  place. 
We  hypothesize  that  one  or  more  attack  types  are  present  in  the  en¬ 
vironment,  leading  to  a  three-tiered  model  as  shown  in  figure  4.  In 
this  example,  we  show  two  attack  types,  buffer-overflow  and  packet- 
flood.  Packet-floods  can  affect  each  of  the  resources  because  they  are 
all  networked  systems;  buffer-overflows  affect  only  the  2  resources 
which  are  modeled  as  instances  of  a  system  type  vulnerable  to  such 
attacks. 

As  in  earlier  techniques,  diagnosis  is  initiated  when  a  discrepancy 
is  detected;  in  this  case  this  means  that  the  predicted  production  time 
of  an  output  differs  from  those  actually  observed  after  an  input  has 
been  presented.  The  goal  of  the  diagnostic  process  is  to  infer  as  much 
as  possible  about  where  the  computation  failed  (so  that  we  may  re¬ 
cover  from  the  failure)  and  about  what  parts  of  the  infrastructure  may 
be  compromised  (so  that  we  can  avoid  using  them  again  until  correc¬ 
tive  action  is  taken).  We  are  therefore  looking  for  two  things:  the 
most  likely  explanation(s)  of  the  observed  discrepancies  and  updated 
probabilities  for  the  modes  of  the  infrastructural  components. 

To  do  this  we  use  techniques  similar  to  [4,  8].  We  first  identify  all 
conflict  sets,  and  then  proceed  to  calculate  the  posterior  probabili¬ 
ties  of  the  modes  of  each  of  the  computational  components.  We  do 
these  tasks  by  a  mixture  of  symbolic  and  Bayesian  techiques;  sym¬ 
bolic  model-based  reasoning  is  used  to  predict  the  behavior  of  the 
system,  given  an  assumed  set  of  behavioral  modes.  Whenever  the 
symbolic  reasoning  process  discovers  a  conflict  (an  incompatible  set 
of  behavioral  modes),  it  adds  to  the  Bayesian  network  a  new  node 
corresponding  to  the  conflict  (see  below).  Bayesian  techniques  are 
then  used  to  solve  the  extended  network  to  get  updated  probabilities. 

This  approach  involves  an  exhaustive  enumeration  of  the  combi¬ 
nations  of  the  models  of  the  computational  components.  This  allows 
us  to  calculate  the  exact  posterior  probabilties.  However,  this  is  ex¬ 
pensive  and  the  precision  may  not  be  needed.  It  would  be  possible 
to  instead  use  the  techniques  in  [  1 0]  to  generate  only  the  most  likely 
diagnoses  and  to  use  these  to  estimate  the  posterior  probabilities;  but 
we  have  not  yet  pursued  this  approach. 

We  instead  follow  the  following  approach:  We  alternate  the  find¬ 
ing  of  conflicts  with  the  search  for  diagnoses.  After  each  “conflict” 
node  is  added  to  the  Bayesian  network  (see  below)  the  network  is 
solved;  this  gives  us  updated  probabilities  for  each  behavioral  mode 
of  each  component.  We  can,  therefore,  examine  the  behavioral  modes 
in  the  current  conflict  and  pick  that  component  whose  current  behav¬ 
ioral  mode  is  least  likely.  We  discard  this  mode,  and  pick  the  most 


likely  alternative;  we  continue  this  process  of  detecting  conflicts,  dis¬ 
carding  the  least  likely  model  in  the  conflict  and  picking  its  most 
likely  alternative  until  a  consistent  set  is  found.  This  process  is  a 
good  heuristic  for  finding  the  most  likely  diagnosis  2. 

Our  models  of  computational  behavior  (the  delay  models)  are  used 
to  predict  the  behavior  of  the  computational  components  and  to  com¬ 
pare  the  predictions  with  observations.  When  a  discrepancy  is  de¬ 
tected,  we  use  dependency  tracing  to  find  the  conflict  set  underlying 
the  discrepancy  (i.e.  a  set  of  behavioral  modes  which  are  inconsis¬ 
tent).  At  this  point  a  new  (binary  truth  value)  node  is  added  to  the 
Bayesian  network  representing  the  conflict  as  shown  in  Figure  5. 
This  node  has  an  incoming  arc  from  every  node  that  participates  in 
the  conflict.  It  has  a  conditional  probability  table  corresponding  to  a 
pure  "logical  and"  i.e.  its  true  state  has  a  probability  of  1.0  if  all  the 
incoming  nodes  are  in  their  true  states  and  it  otherwise  has  probabil¬ 
ity  1.0  of  being  in  its  false  state. 

Since  this  node  represents  a  logical  contradiction,  it  is  pinned  in 
its  false  state.  Adding  this  node  to  the  network  imposes  a  logical  con¬ 
straint  on  the  probabilistic  Bayesian  network;  the  constraint  imposed 
is  that  the  conflict  discovered  by  the  symbolic,  model-based  behav¬ 
ioral  simulation  is  impossible.  We  continue  to  explore  other  combi¬ 
nations  of  behavioral  modes,  until  all  possible  minimal  conflicts  are 
discovered.  Each  of  these  conflicts  extends  the  Bayesian  network  as 
before.  The  set  of  such  conflicts  constitutes  the  full  set  of  logical 
constaints  on  the  values  taken  on  within  the  Bayesian  network;  thus, 
once  we  have  augmented  the  Bayesian  network  with  nodes  corre¬ 
sponding  to  each  conflict,  the  network  has  all  the  information  avail¬ 
able.  3. 

At  this  point,  we  have  found  all  the  minimal  conflicts  and  added 
conflict  nodes  to  the  Bayesian  network  for  each.  We  therefore  also 
know  all  the  possible  diagnoses  since  these  are  sets  of  behavioral 
modes  (one  for  each  component)  which  are  not  supersets  of  any  con¬ 
flict  set.  For  each  of  these  we  create  a  node  in  the  Bayesian  network 
which  is  the  logical-and  of  the  nodes  corresponding  to  the  behavioral 
modes  of  the  components.  This  node  represents  the  probability  of  this 
particular  diagnosis.  The  Bayesian  network  is  then  solved.  This  gives 
us  updated  probabilities  for  all  possible  diagnoses,  for  the  behavioral 
modes  of  the  computational  components  and  for  the  modes  of  the 
underlying  infrastructural  components.  Furthermore,  these  updated 
probabilities  are  those  which  are  consistent  with  all  the  constraints 
we  can  obtain  from  the  behavioral  models.  Thus,  they  represent  as 
complete  an  assessment  as  is  possible  of  the  state  of  compromise  in 
the  infrastructure.  These  posterior  estimates  can  be  taken  as  priors  in 
further  diagnostic  tasks  and  they  can  also  be  used  as  a  “trust  model” 
informing  users  of  the  system  (including  self  adaptive  computations) 
of  the  trustworthiness  of  the  various  pieces  of  infrastructure  which 
they  will  need  to  use. 

7  Results 

The  sample  system  shown  in  Figure  3  was  run  through  several  anal¬ 
yses  including  both  those  in  which  the  outputs  are  within  the  ex¬ 
pected  range  and  those  in  which  the  outputs  are  unexpected.  Figure 
6  shows  the  results  of  an  analysis  in  which  the  outputs  are  within  the 
expected  range.  Figure  7  and  8  show  the  results  of  an  analysis  of  an 

2  However  since  the  probabilities  of  the  failure  modes  of  different  compo¬ 
nents  are  not  independent,  this  is  only  a  heuristic 

3  [8]  builds  logical  reasoning  directly  into  the  Bayesian  network  system  be¬ 
cause  the  logical  inferences  needed  are  simple  enough  to  be  accomodated. 

However,  our  inference  needs  are  more  complex  and  not  easily  amenable 

to  this  approach 


abnormal  case.  Inputs  are  supplied  at  times  10  and  15  for  the  two  in¬ 
puts  of  Web-Server;  each  of  the  figures  shows  the  times  at  which  the 
the  outputs  of  Currency-Trader  and  Bond-Trader  are  observed.  There 
are  four  runs  for  each  case,  each  with  a  different  attack  model.  In  the 
first,  it  is  assumed  that  there  are  no  attacks  present  and  the  a  priori 
values  are  used  for  the  probabilities  of  the  different  modes  of  each 
resource.  The  second  run  assumes  only  a  buffer-overflow  attack;  the 
third  run  assumes  only  a  packet-flood  attack.  The  fourth  run  assumes 
both  types  of  attacks.  There  are  four  columns  in  each  of  the  results 
chart,  one  for  each  of  these  runs.  The  top  chart  in  each  figure  shows 
the  a  priori  and  posterior  probabilities  for  each  resource  being  in  its 
“hacked”  mode.  The  middle  chart  shows  the  posterior  probabilities 
for  each  mode  of  each  computational  component.  The  bottom  bot¬ 
tom  chart  in  each  figure  shows  the  posterior  probabilites  that  each  of 
the  two  types  of  attacks  have  occurred.  4. 

There  are  more  than  two  dozen  possible  diagnoses  in  the  abnormal 
case.  It  should  be  noted  that  even  the  most  likely  diagnosis  is  actually 
not  all  that  likely;  in  addition  the  next  several  diagnoses  are  nearly 
equally  as  likely.  The  most  likely  diagnosis  is  therefore  not  particu¬ 
larly  informative  for  our  two  goals  of  recovering  from  the  failure  and 
steering  away  from  compromised  resources  in  the  future.  However, 
the  posterior  probabilities  of  the  modes  of  the  infrastructure  compo¬ 
nents  are,  in  fact,  useful  guides  for  the  second  of  these  goals.  The 
posterior  probabilities  of  the  behavioral  modes  of  the  computational 
resources  are  useful  guides  for  the  first  goal,  because  these  probabili¬ 
ties  aggregate  the  information  contained  in  the  individual  diagnoses. 

The  most  significant  change  is  the  increase  in  the  probabilities  that 
the  resources  named  JPMorgan  and  Wallst-server  are  hacked.  This 
changes  the  trustworthiness  ordering  of  the  resources:  JPMorgan  is 
a  posteriori  the  least  trustworthy  resource,  while  the  a  priori  listing 
ranks  Trader-Joe  followed  by  Bonds-R-US  as  the  least  trustworthy. 
This  follows  from  the  fact  that  the  JPMorgan  resources  is  utilized 
by  the  computations  Yen-Monitor  and  Dollar-Monitor  both  of  which 
are  very  likely  to  be  in  abnormal  modes  and  the  most  likely  expla¬ 
nation  is  that  that  JPMorgan  causes  a  common-mode  failure.  Notice 
that  in  the  last  two  columns  when  packet-flood  attacks  are  possible, 
all  the  resources  are  much  more  likely  to  be  hacked.  Qualitatively, 
this  is  because  all  the  resources  are  vulnerable  to  the  packet-flood 
attack.  The  misbehavior  of  the  computational  components  provides 
evidence  that  JPMorgan  is  hacked  which  in  turn  provides  evidence  of 
a  packet  flood  attack.  But  since  packet-flood  attacks  affect  all  the  re¬ 
sources,  this  increases  the  likelihood  that  other  resources  are  hacked 
as  well.  The  Bayesian  network  carries  out  the  quantitative  version  of 
this  argument. 

It  is  worth  noting  that  this  propagation  of  trust  can  carry  over  to 
resources  not  used  in  the  misbehaving  computation.  For  example, 
assume  that  the  environment  contains  another  resource  (call  it  “new¬ 
bie”)  that  is  subject  to  the  same  attacks  as  the  ones  (e.g.  JPMorgan) 
that  participated  in  the  faulty  computation.  The  misbehavior  in  the 
computation  is  evidence  that  JPMorgan  is  “hacked”  and  this,  in  turn, 
is  evidence  that  an  attacked  succeeded.  But  this  would  lend  weight  to 
the  conclusion  that  other  resources  (e.g.  Newbie)  subject  to  this  same 
attack  had  also  been  compromised.  The  Bayesian  network  would 
propagate  probabilities  in  exactly  this  fashion  leading  to  a  posterior 
assessment  that  Newbie  has  been  hacked  (although  this  probability 


4  The  implementation  is  in  CommonLisp  and  uses  the  Joshua  [7]  rule-based 
reasoning  system  as  well  as  the  Ideal  system  [9]  and  in  particular  its  imple¬ 
mentation  of  the  algorithm  described  in  [6],  On  a  300  MHz  powerbook,  the 
total  solution  time  is  under  1  minute.  By  far,  the  most  expensive  part  of  this 
is  calculating  the  probabilities  of  the  complete  set  of  diagnoses.  The  most 
likely  diagnosis  and  all  conflict  sets  are  located  in  less  than  10  seconds) 


will  be  lower  than  the  probability  that  JPMorgan  is  hacked). 

8  Conclusions  and  Future  Work 

The  example  above  illustrates  how  model-based  reasoning  tech¬ 
niques  can  be  used  to  extract  information  from  a  single  run.  Our 
example  is  intentionally  fanciful  since  we  are  at  the  present  con¬ 
centrating  on  the  development  of  the  representational  and  reasoning 
frameworks.  In  future  work  we  will  explore  realistic  models  of  real 
systems. 

The  information  extracted  is  probabilistic  and  it  sheds  light  both 
on  the  question  of  where  the  computation  might  have  failed,  on  what 
underlying  resources  might  have  been  compromised  and  on  what  at¬ 
tacks  might  have  succeeded. 

It  is  notable  that  the  identification  of  the  most  likely  diagnosis  is 
not  particularly  informative.  For  example,  in  the  most  likely  diag¬ 
nosis  Yen-Monitor  is  in  its  Normal  mode.  However,  the  most  likely 
behavioral  mode  for  Yen-Monitor  is  its  Slower  mode  which  occurs 
in  many  of  the  remaining  diagnoses.  The  posterior  probabilites  of  the 
behavioral  modes  aggregate  the  probabilites  from  each  of  the  possi¬ 
ble  diagnoses,  producing  an  overall  assessment  that  is  more  informa¬ 
tive  than  any  individual  diagnosis.  Of  course,  if  there  are  very  few 
diagnoses,  or  the  most  likely  diagnosis  is  extremely  probable,  then 
the  probabilities  of  its  behavioral  modes  will  approximate  the  overall 
posterior  probabilities. 

It  is  important  to  keep  in  mind  why  we  are  interested  in  the  diag¬ 
noses  at  all.  The  goal  of  the  system  is  to  recover  from  the  failure  and 
to  steer  away  from  future  trouble.  To  do  this  it  needs  to  know  how 
much  of  the  computation  has  been  completed  successfully  and  how 
much  remains  to  be  done.  Given  such  information  the  system  would 
pick  a  rollback  point  for  recovery  that  includes  no  failed  part  of  the 
computation.  Furthermore,  the  chosen  rollback  point  would  maxi¬ 
mize  the  probability  of  continuing  the  restarted  computation  to  com¬ 
pletion.  As  we  just  saw,  an  individual  diagnosis,  even  the  most  likely 
diagnosis,  does  not  give  us  the  information  we  need  to  do  this.  When 
the  available  evidence  supports  multiple  diagnostic  hypotheses,  then 
our  interest  should  shift  from  individual  diagnoses  to  aggregate  fail¬ 
ure  probabilities  and  this  information  is  conveyed  completely  by  the 
posterior  probabilities  of  the  failure  modes.  I.e.  if  the  posterior  prob¬ 
ability  that  Yen-Monitor  failed  is  high,  then  we  don’t  actually  care 
that  there  are  multiple  (multiple  point  of  failure)  diagnoses  involving 
this  failure  nor  do  we  care  how  likely  each  of  these  diagnoses  is.  In¬ 
stead  what  we  do  care  about  is  that  it's  very  likely  that  Yen-monitor 
didn’t  do  its  job  and  that  we  should  select  a  rollback  point  prior  to  its 
execution.  Similarly,  in  choosing  a  recovery  plan  we  should  avoid  us¬ 
ing  those  resources  whose  posterior  failure  probabilities  are  highest. 
5 

This  is  to  say  that  the  goal  of  the  diagnostic  process  should  be  to 
assess  the  overall  posterior  probabilities  of  the  behavioral  modes  of 
the  computational  and  infrastructure  components.  These  give  us  evi¬ 
dence  for  which  computational  resources  are  to  be  to  be  trusted  dur¬ 
ing  the  recovery  process  and  during  subsequent  computations.  This  is 

5  Of  course,  gathering  further  evidence  might  reduce  the  number  of  possible 
diagnoses  leading  to  greater  resolution.  However,  in  our  context  there  are 
two  difficulties  with  attempting  to  do  this.  First,  it  would  take  time  and  there 
might  be  tight  timeliness  contraints  on  the  failed  computation  (e.g.  suppose 
the  computation  was  processing  sensor  data  which  must  be  acted  on  very 
quickly).  Second,  any  attempt  to  gather  more  data  would  involve  running 
the  same,  or  similar,  computations  again  when  we  know  that  something  is 
compromised;  this  might  lead  to  loss  or  destruction  of  data.  Making  this 
tradeoff  correctly  involves  estimating  the  expected  cost  of  new  information 
and  it  expected  benefit.  It  is  possible  that  such  and  analysis  would  suggest 
that  acting  on  the  available  diagnostic  evidence  is  the  best  course  of  action 


a  different  definition  of  the  goal  of  diagnostic  activity  than  has  been 
used  in  previous  research  on  model-based  diagnosis. 

We  have  not  yet  addressed  the  details  of  how  the  system  should 
use  this  information  in  forming  a  recovery  plan.  The  general  outline 
is  that  when  assigning  a  computation  to  a  resource  it  should  choose 
that  resource  which  is  most  likely  to  be  in  n  mode  that  will  success¬ 
fully  complete  the  computation.  But  the  probabilities  of  the  modes 
of  different  resources  are  not  independent;  they  are  linked  by  the 
Bayesian  network.  Having  decided  to  use  a  particular  resource  be¬ 
cause  it’s  likely  to  be  in  an  acceptable  mode,  the  system  should  pin 
the  Bayesian  network  into  a  state  where  the  resource  is  believed  to 
be  in  the  desired  state  and  re-solve  the  network.  Subsequent  choices 
should  be  made  in  light  of  the  updated  probabilities. 

We  have  also  not  yet  addressed  the  question  of  what  actions  the 
system  might  take  to  obtain  more  information  in  future  runs.  The 
Minimum  Entropy  approach  in  [3]  provides  a  useful  framework. 
However,  the  current  context  provides  more  degrees  of  freedom;  in 
addition  to  making  new  observations,  we  can  also  change  the  assign¬ 
ment  of  resources  to  computational  components  in  a  way  that  will 
maximize  the  expected  gain  in  information.  The  details  of  this  re¬ 
main  for  future  research. 
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Figure  2.  Modeling  Computational  and  Infrastructure  Components 
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Figure  4.  An  Example  of  the  Three  Tiered  System  Modeling  Framework 


Figure  5.  Adding  a  Conflict  Node  to  the  Bayesian  Network 
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Abstract 

This  paper  summarizes  the  work  done  in  the  course 
of  the  Jade  project,  which  deals  with  automatic  de¬ 
bugging  of  Java  programs.  Besides  a  brief  intro¬ 
duction  to  the  Jade  project,  models  developed  to 
debug  Java  programs  are  evaluated  and  results  are 
presented.  Furthermore,  insights  gained  from  the 
results  are  discussed  and  topics  for  further  research 
are  identified. 

1  Introduction 

For  the  last  three  years  the  Jade  project  has  examined  the  ap¬ 
plicability  of  model-based  diagnosis  (MBD)  techniques  to  the 
software  debugging  domain.  In  particular,  the  goals  of  Jade 
were  (1)  to  establish  a  general  theory  of  model-based  soft¬ 
ware  debugging  with  a  focus  on  object-oriented  programming 
languages,  (2)  to  describe  the  semantics  of  the  Java  program¬ 
ming  language  in  terms  of  logical  models  usable  for  diagno¬ 
sis,  and  (3)  to  develop  an  intelligent  debugging  environment 
for  Java  programs  based  on  theoretic  results. 

The  main  practical  achievement  of  the  Jade  project  is  the 
interactive  debugging  environment,  which  allows  us  to  effi¬ 
ciently  locate  bugs  in  faulty  Java  programs.  Currently,  this 
debugger  is  fully  functional  with  regard  to  nearly  all  aspects 
of  the  Java  programming  language  and  comes  complete  with 
a  user-friendly  GUI,  the  diagnosis  system  being  integrated 
into  a  “normal”  interactive  debugger  interface.  The  Jade 
debugger  limits  the  search  space  of  bug  candidates  by  com¬ 
puting  diagnoses  for  a  given  (incorrect)  input/output  behav¬ 
ior.  This  is  done  by  using  model-based  diagnosis  techniques, 
which  in  some  cases  have  been  adapted  to  suit  the  needs  of 
an  object-oriented  debugging  environment.  Furthermore,  the 
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debugger  can  be  used  to  unambiguously  locate  faults  through 
an  interactive  debugging  process,  which  is  based  on  the  iter¬ 
ative  computation  of  diagnoses,  measurement  selection,  and 
input  of  additional  observations  by  the  user. 

This  work  is  organized  as  follows:  The  next  section  briefly 
describes  the  program  models  used  by  the  Jade  debugging 
environment.  Section  3  presents  results  obtained  with  the 
models  introduced  in  Section  2.  Section  4  analyzes  the  results 
from  Section  3  and  discusses  some  properties  of  the  models. 
In  Section  5,  we  point  out  interesting  topics  for  further  re¬ 
search.  Section  6  briefly  compares  our  approach  to  related 
work.  Finally,  we  conclude  the  paper. 

2  Program  models 

Since  model-based  diagnosis  relies  on  the  existence  of  a 
logical  model  description  of  the  underlying  target  system, 
one  of  the  most  important  components  of  the  Jade  sys¬ 
tem  are  its  models.  Currently,  the  Jade  debugger  makes 
use  of  two  model  classes,  dependency-based  models  and 
value-based  models.  This  section  briefly  describes  these 
model  types.  More  comprehensive  descriptions  can  be  found 
in  [Stumptner  el  al.,  2001;  Wieland,  2001;  Mayer,  2000; 
2001], 

Dependency-based  models  are  based  on  the  collection 
of  all  data  and  control  dependencies  of  a  given  Java  pro¬ 
gram.  As  an  example,  we  look  at  a  single  statement  Si, 
e.g.,  int  x=a*b;.  Informally,  the  variable  dependencies 
arising  from  this  statement  can  be  specified  by  S;  :  x  t— 
{a,  b}.  A  formal  logical  model  can  now  automatically  be 
derived  from  this  dependency.  For  our  example  it  reads 
-i AB(Si)  A  ok(a)  A  ok(b)  =>  ok(x ),  where  the  predicate 
AB  stands  for  the  assumption  that  a  certain  statement  is  in¬ 
correct,  i.e.,  behaves  abnormally.  The  predicate  ok(v)  speci¬ 
fies  that  the  value  of  variable  v  is  correct  without  making  use 
of  the  concrete  value  of  v.  Observations  for  such  a  model 
can  be  expressed  by  specifying  the  correctness  or  incorrect¬ 
ness  of  a  certain  variable,  e.g.,  -i ok(x)  in  the  above  example. 
In  the  course  of  the  Jade  project  different  dependency-based 
models  have  been  created  that  vary  in  their  levels  of  abstrac¬ 
tion  and  the  amount  of  information  used  during  their  creation. 
These  models  are: 

ETFDM:  A  dependency-based  model,  which  makes  use  of 
a  concrete  execution  trace  [Wieland,  2001]. 
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Table  1 :  Diagnosis  and  debugging  results  of  the  dependency-based  models 


DFDM:  A  dependency-based  model,  which  only  makes  use 
of  static  (compile-time)  information,  such  as  the  Java 
source  code  and  the  programming  language  seman¬ 
tics  [Stumptner  et  al.,  2001;  Wieland,  2001], 

SFDM:  Another  dependency-based  model,  which  is  based 
on  either  the  ETFDM  or  the  DFDM  and  involves  a 
higher  level  of  abstraction  by  removing  the  distinction 
between  object  locations  and  references  [Stumptner  et 
al.,  2001;  Wieland,  2001], 

Value-based  models  are  models  which  make  use  of 
concrete  execution  values  and  propagate  these  values  from 
the  model’s  inputs  to  its  outputs  and  (if  possible)  from  the 
model’s  outputs  to  its  inputs.  A  simple  value-based  model 
for  the  above  example  statement  reads  ->  AB(Si)  hi  =  a*b, 
where  x,  a ,  and  b  stand  for  concrete  variable  values  as  com¬ 
puted  at  run-time.  In  the  case  of  value-based  models  observa¬ 
tions  can  be  expressed  by  specifying  the  concrete  value  of  a 
certain  variable,  e.g.,  x  =  6  in  the  above  example.  The  Jade 
system  currently  operates  on  the  following  two  value-based 
model  types: 

VBM:  A  value-based  model,  which  makes  use  of  not  only 
the  underlying  program  dependencies,  but  also  concrete 
evaluation  values  and  the  full  programming  language  se¬ 
mantics  [Mayer,  2000]. 

LF-VBM:  A  second  value-based  model,  which  is  based 
on  the  unfolded  source  code  for  a  particular  program 
run  [Mayer,  2001],  In  particular,  the  loops  are  expanded 
into  a  set  of  nested  conditional  statements,  where  the 
conditional  statements  are  modeled  specially  in  order  to 
provide  better  backward  reasoning  capabilities. 

Although  the  expressiveness  of  the  individual  models  is  not 
exactly  the  same,  all  models  support  a  considerable  subset  of 
the  Java  programming  language.  Currently,  exception  han¬ 
dling  and  programs  using  multiple  threads  are  not  supported. 
Furthermore,  the  value-based  models  do  not  support  recursive 
method  calls.  The  models  are  designed  to  locate  functional 
faults,  e.g.  wrong  operators  or  reversed  conditions.  They 
cannot  reliably  locate  structural  faults  or  more  severe  defects, 
such  as  wrong  algorithms  or  data  structures. 

3  Results 

In  this  section  we  describe  results  obtained  by  applying 
the  models  introduced  above  to  a  set  of  example  pro¬ 
grams  and  compare  them  with  respect  to  their  debug¬ 


ging  and  diagnostic  accuracy.  The  tests  were  separated 
into  two  test  sets,  where  one  test  set  was  used  to  com¬ 
pare  the  dependency-based  models,  whereas  the  other  set 
was  used  to  evaluate  the  value-based  models.  A  compar¬ 
ison  between  the  dependency-based  models  and  the  value- 
based  models  can  be  found  in  [Stumptner  et  al.,  2001], 
Most  of  the  example  programs  can  be  obtained  from 
http : //www . dbai . tuwien . ac . at /pro j / Jade/. 

3.1  Dependency-based  models 

The  first  test  series  aims  at  evaluating  the  performance  of  the 
used  dependency-based  models,  i.e.,  DFDMs,  ETFDMs,  and 
SFDMs.  Furthermore,  we  compare  the  results  scored  by  these 
model  types.  In  particular,  the  test  series  has  two  main  goals: 
(1)  to  examine  the  ability  of  the  Jade  debugger  to  reduce  the 
search  space  of  bug  candidates.  In  other  words,  we  test  which 
parts  of  a  Java  program  can  automatically  be  excluded  from 
the  fault  localization  process  in  a  single  diagnosis  step  and 
which  parts  of  the  search  space  remain  for  further  debugging 
actions.  (2)  to  evaluate  the  debugging  performance  of  the 
Jade  tool,  i.e.,  determining  the  amount  of  user  interaction 
needed  to  unambiguously  locate  a  fault  in  a  Java  program. 

In  order  to  carry  out  these  tests  we  implement  a  couple 
of  test  programs  demonstrating  simple  variable  dependencies 
(simulating  a  binary  adder,  numeric  examples),  making  use  of 
control  structures  (if  and  while  statements),  and  finally  mul¬ 
tiple  objects  and  instance  fields  together  with  linked  lists  and 
general  processing  (a  small  library  application).  We  then  con¬ 
struct  test  cases  for  each  program  P  by  specifying  the  correct 
input/output  behavior  of  P  and  installing  a  single-fault  into 
P.  Overall  52  test  cases  are  constructed  and  used  for  the  eval¬ 
uation  of  the  system’s  performance.  Table  1  shows  all  tests 
carried  out  with  each  row  summarizing  all  tests  performed  in 
a  single  test  series.  Column  #TC  denotes  the  number  of  tests 
of  the  respective  test  series. 

The  diagnostic  performance  of  the  Jade  system  in  the  con¬ 
text  of  dependency-based  models  is  given  in  columns  4  to  8 
of  Table  1.  Column  0S\  shows  the  average  number  of  top- 
level  statements  of  the  tested  programs  in  a  single  test  series. 
Since  the  Jade  tool  performs  hierarchical  debugging,  only 
these  top-level  statements  (this  excludes  statements  nested 
in  loops  and  selection  statements)  can  be  identified  as  the 
source  of  a  fault  in  a  single  diagnosis  step.  Columns  0D i 
and  0D2  present  the  number  of  top-level  statements,  which 
remain  as  possible  fault  candidates  after  a  single  diagnosis 
step  has  been  performed  using  DFDMs  and  ETFDMs,  re- 


spectively.  In  other  words,  the  difference  between  0S1  and 
0D 1  (0l)  >)  shows  the  number  of  statements,  which  can  be 
eliminated  from  the  debugging  scope  in  a  single  diagnosis 
step.  Columns  0D\{%)  and  0X>2(%)  show  the  number  of 
remaining  statements  for  both  model  types  in  relation  to  the 
total  number  of  top-level  statement,  i.e.,  0S\.  These  columns 
present  the  percentage  of  statements,  which  remain  as  possi¬ 
ble  fault  candidates  for  further  debugging  actions.  All  tests 
are  also  performed  with  the  simplified  versions  of  the  test 
programs’  DFDMs.  In  cases  where  these  tests  yield  results 
different  from  tests  with  the  full  DFDMs,  the  results  obtained 
from  the  SFDMs  are  given  in  brackets.  Note  that  no  tests  are 
carried  out  with  simplified  versions  of  ETFDMs,  since  these 
models  are  not  yet  fully  supported  by  the  Jade  debugging 
tool. 

The  right  side  of  Table  1  (columns  9  to  14)  depicts  the 
debugging  performance  of  the  Jade  debugging  environment. 
Since  we  are  now  interested  in  the  exact  localization  of  faults, 
we  no  longer  deal  with  top-level  statement  only,  but  also  take 
statements  nested  in  loop  and  selection  statements  into  con¬ 
sideration.  Column  0S2  shows  the  average  number  of  all 
statements  of  the  respective  tested  program.  Column  0R  in¬ 
cludes  the  average  indices  of  those  statements,  in  which  the 
single  fault  has  been  installed  during  the  test  design  phase. 
If  we  argue  that  with  traditional  debugging  tools  one  has  to 
step  through  the  code  manually  statement  by  statement  un¬ 
til  the  bug  is  located,  the  values  in  column  0R  provide  a 
reasonable  reference  value  for  the  amount  of  user  interac¬ 
tion  needed  by  the  Jade  system  to  exactly  locate  a  fault. 
The  latter  is  presented  in  columns  0T1  (DFDMs)  and  0T2 
(ETFDMs).  Columns  0X1  (%)  and  0X2  (%)  show  the  av¬ 
erage  number  of  user  interaction  relative  to  the  average  in¬ 
dex  of  the  buggy  statement,  i.e.,  0X1  (%)  =  0T1/0R  and 
0X2  (%)  =  0T2/0R. 

3.2  Value-based  models 

In  a  second  step  we  test  the  diagnostic  performance  of  the 
more  detailed  and  semantically  stronger  value-based  models, 
i.e.,  VBMs  and  LF-VBMs.  For  this  task  we  implement  a  sec¬ 
ond  set  of  example  programs  which  is  designed  especially 
to  investigate  the  specific  advantages  and  disadvantages  of 
the  value-based  model  variants.  Whereas  some  examples  are 
small  and  specifically  designed  to  demonstrate  different  as¬ 
pects  of  the  models,  most  of  the  example  programs  imple¬ 
ment  well-known  algorithms  which  could  be  part  of  larger 
programs.  For  example,  programs  executing  a  binary  search 
procedure,  computing  the  Huffman  encoding  of  an  array  of 
characters,  or  applying  Gauss  elimination  are  part  of  this  test 
suite.  Similar  to  the  tests  carried  out  with  the  dependency- 
based  models,  faults  were  seeded  into  each  program  such  that 
each  test  case  is  influenced  by  one  fault.  Again,  we  assume 
that  the  faulty  program  is  a  close  variant  of  the  correct  pro¬ 
gram.  We  do  not  deal  with  wrong  choice  of  algorithms,  data 
structures  or  similar  major  design  defects. 

The  diagnostic  experiments  are  performed  by  specifying 
the  inputs  of  the  program  together  with  the  expected  results 
as  observations.  A  summary  report  of  the  obtained  results 
for  each  example  program  is  depicted  in  Table  2.  Several  as¬ 
pects  of  the  examples  are  listed:  Stm  denotes  the  number  of 


Program 

Stm 

C 

VBM 

D  % 

C 

LF-VBM 

D  H  S 

% 

BinSearch 

27 

16 

6 

63 

43 

1 

1 

2 

8 

Binomial 

76 

26 

9 

42 

255 

24 

1 

1 

32 

BoundedSum 

16 

14 

4 

38 

19 

1 

0 

2 

38 

BubbleSort 

15 

10 

6 

93 

34 

7 

1 

1 

47 

FindPair 

5 

4 

4 

100 

10 

1 

0 

2 

80 

FindPositive2 

17 

13 

3 

41 

20 

2 

1 

1 

12 

FindPositive3 

17 

13 

3 

41 

20 

2 

1 

1 

12 

Hamming 

27 

19 

11 

70 

95 

9 

1 

1 

33 

Huffman 

64 

22 

9 

80 

161 

9 

0 

(2)  (25) 

Huffman 

64 

22 

6 

59 

164 

12 

1 

1 

19 

Intersection 

95 

31 

12 

84 

155 

8 

1 

1 

5 

Library 

24 

21 

6 

38 

36 

5 

0 

2 

34 

Matrix 

71 

21 

21 

100 

127 

37 

1 

1 

52 

MaxSearch2 

21 

16 

3 

38 

37 

2 

0 

2 

19 

MultLoops 

21 

12 

2 

19 

27 

4 

2 

3 

24 

Multiset 

97 

55 

8 

28 

283 

1 

0 

(2)  (11) 

Permutation 

24 

17 

14 

96 

29 

3 

1 

1 

13 

PermutationO 

26 

19 

12 

69 

33 

1 

1 

1 

4 

Permutation  1 

26 

19 

12 

69 

32 

8 

0 

3 

100 

Permutation2 

26 

19 

15 

85 

33 

9 

1 

1 

35 

Permutation3 

24 

19 

12 

67 

33 

2 

0 

3 

50 

Polynom 

120 

64 

14 

24 

189 

26 

0 

(3)  (13) 

SearchTree 

84 

41 

41 

100 

140 

45 

0 

(1)  (54) 

SkipEqual 

5 

4 

4 

100 

11 

2 

1 

1 

40 

Stat 

23 

17 

3 

39 

42 

2 

0 

4 

48 

Sum 

5 

4 

3 

80 

10 

3 

1 

1 

40 

SumPowers 

21 

12 

8 

81 

36 

5 

1 

1 

24 

0 

39 

20 

9 

65 

77 

8 

0.6  (1.6)  (32)  | 

Table  2:  Diagnosis  results  of  the  value-based  models 

statements  in  the  program,  C  represents  the  number  of  com¬ 
ponents  in  the  generated  model.  D  stands  for  the  number 
of  diagnoses  of  minimal  cardinality  that  are  obtained  and  H 
represents  the  number  of  diagnoses  from  D  that  actually  in¬ 
clude  the  seeded  fault.  S  denotes  the  cardinality  at  which 
the  diagnostic  process  is  stopped  because  the  seeded  fault  has 
been  located.  Finally,  the  %-column  lists  the  percentage  of 
the  statements  that  have  to  be  examined  in  the  worst  case  un¬ 
til  the  seeded  fault  is  found.  Here  it  is  assumed  that  the  di¬ 
agnoses  are  presented  with  increasing  cardinality.  Note  that 
these  numbers  can  further  be  improved  by  suitable  heuristics, 
which  present  the  diagnoses  according  to  their  ’likelihood’ 
to  explain  the  faults.  For  the  VBM,  the  columns  H  and  S 
are  omitted  because  their  value  is  always  equal  to  one.  Num¬ 
bers  in  parentheses  denote  cases  where  the  faults  cannot  be 
located  because  the  maximum  time  allowed  for  diagnosis  is 
exceeded.  In  these  cases  the  numbers  are  lower  bounds  to 
the  actual  results  that  would  be  obtained  when  continuing  the 
diagnostic  process  to  its  completion. 

4  Discussion 

Based  on  the  results  from  Section  3,  in  this  section  we  dis¬ 
cuss  some  important  properties  of  the  proposed  models  and 
present  insights  gained  during  the  Jade  project. 


From  the  results  it  can  be  seen  that  the  amount  of  code 
that  has  to  be  analyzed  in  order  to  locate  a  fault  can  be  re¬ 
duced  significantly  with  all  models.  If  we  look  at  Table  1  we 
find  that  in  the  test  series  carried  out  with  dependency-based 
models  approximately  40%  of  the  top-level  statements  can  be 
eliminated  from  the  debugging  scope,  leaving  some  60%  for 
further  debugging  actions.  Interestingly,  the  average  results 
obtained  with  different  dependency-based  model  types  were 
quite  similar  with  slight  advantages  to  ETFDMs  (in  compar¬ 
ison  to  DFDMs)  and  full  model  versions  (in  comparison  to 
SFDMs).  In  the  case  of  value-based  models,  the  results  lie 
in  the  same  order  of  magnitude.  In  particular,  between  40 
and  80%  of  all  statements  have  to  be  checked,  with  the  av¬ 
erage  being  at  65%.  Note  that  this  does  not  indicate  a  better 
performance  of  dependency-based  models  in  comparison  to 
value-based  models,  since  completely  different  test  programs 
were  used  to  evaluate  the  different  model  types.  In  particular, 
the  test  series  with  the  value-based  variants  in  general  used 
longer  and  more  complex  test  methods.  These  methods  result 
in  only  very  few  statements  being  removed  from  the  suspect 
code  in  case  of  dependency-based  models,  but  still  yield  re¬ 
markable  results  with  VBMs.  For  a  more  detailed  comparison 
of  dependency-based  and  value-based  models  see  [Stumptner 
et  al.,  2001], 

Dependency-based  models  One  major  advantage  of 
dependency-based  models  is  that  they  can  be  constructed  and 
applied  to  actual  diagnosis  problems  very  quickly.  This  is 
also  true  for  medium-  to  large-size  programs.  They  are  also 
easier  to  handle  than  their  value-based  counterparts,  since 
they  require  observations  only  to  state  whether  the  value  of 
a  certain  variable  is  correct  or  not,  whereas  with  value-based 
models  concrete  execution  values  are  needed.  Generally,  the 
use  of  ETFDMs  results  in  fewer  single  diagnoses,  because 
concrete  execution  traces  are  used  during  the  collection  of 
the  dependencies.  This  becomes  especially  apparent  for  pro¬ 
grams,  which  include  loop  and  selection  statements  or  recur¬ 
sive  method  calls.  The  improved  debugging  performance  of 
ETFDMs  in  comparison  to  DFDMs  comes  with  longer  mod¬ 
eling  times,  since  now  the  creation  of  a  model  not  only  de¬ 
pends  on  the  underlying  source  code,  but  also  on  the  ex¬ 
istence  of  an  execution  trace,  whose  creation  requires  run¬ 
ning  the  program.  It  was  also  shown  that  the  full  versions  of 
DFDMs  and  ETFDMs  are  superior  to  their  simplified  coun¬ 
terparts.  This  is,  because  they  model  object  locations  and 
object  references  by  separate  model  constructs  and  thus  pro¬ 
vide  a  finer-grained  model  architecture.  On  the  other  hand  the 
computation  of  diagnoses  with  full  model  versions  is  compu¬ 
tationally  more  expensive.  Further  on,  the  specification  of 
observations  is  easier  with  simplified  model  versions. 

The  Value-Based  Model  However,  dependency-based 
models  did  not  prove  to  be  an  optimal  solution  for  all  tested 
programs  due  to  their  lack  of  run-time  information.  Note 
that  even  ETFDMs  do  not  make  use  of  concrete  evaluation 
values  directly,  but  only  extract  information  about  executed 
branches  and  numbers  of  iterations  of  loops  from  concrete 
execution  traces.  Therefore,  the  VBM  was  developed,  which 
makes  use  of  the  full  programming  language  semantics  and 
propagates  concrete  evaluation  values  through  the  system.  As 
already  mentioned,  in  many  cases  VBMs  score  satisfying  re¬ 


sults  with  programs,  which  can  hardly  be  diagnosed  using 
dependency-based  approaches  only.  [Stumptner  et  al.,  2001] 
indicates  that  in  general  value-based  models  are  superior  to 
their  dependency-based  counterparts.  Therefore,  although 
VBMs  have  the  drawbacks  of  their  high  computational  re¬ 
quirements,  VBMs  have  proved  as  satisfying  general-purpose 
alternatives  and  complements  to  dependency-based  models. 

Loop  Handling  A  negative  aspect  of  the  dependency- 
based  models  and  the  VBM  is  the  fact  that  these  models  pro¬ 
vide  good  results  for  programs  without  loops  but  fail  to  com¬ 
pute  satisfying  diagnoses  for  programs  that  consist  of  large 
loop  statements.  This  is  due  to  the  fact  that  loop  statements 
are  modeled  hierarchically  and  discrimination  between  state¬ 
ments  inside  the  loops  is  not  possible.  To  overcome  these 
problems,  the  LF-VBM  expands  loops  into  a  set  of  nested 
conditional  statements,  with  separate  assumption  variables 
for  each  statement.  The  number  of  conditional  statements  is 
derived  from  the  initial  execution  of  the  test  cases.  Therefore, 
the  model  is  able  to  reason  about  the  statements  inside  the 
loop  independently,  without  considering  the  whole  loop  as  an 
entity.  This  provides  a  finer-grained  resolution,  which  avoids 
the  problem  of  large  diagnosis  entities  mentioned  above. 

As  can  be  seen  in  Table  2,  switching  from  the  VBM  to 
the  LF-VBM  leads  to  much  better  results.  In  particular,  the 
percentage  of  statements  that  has  to  be  considered  until  the 
fault  is  located  is  reduced  to  32-43%'  on  average,  which  is 
quite  low  compared  to  the  percentage  of  statements  that  was 
computed  by  the  VBM.  For  the  LF-VBM  it  is  no  longer  the 
case  that  every  faulty  statement  is  included  in  a  diagnosis  of 
cardinality  one  (as  with  the  VBM).  Therefore,  the  cardinal¬ 
ity  up  to  which  diagnoses  have  to  be  computed  is  likely  to 
be  greater  than  one,  depending  on  the  type  of  fault  and  the 
program  structure.  For  most  example  programs  the  diagnosis 
cardinality  required  to  locate  a  fault  is  less  than  or  equal  to 
two,  which  is  usually  computationally  feasible  when  consid¬ 
ering  small-  to  medium-sized  programs.  Another  aspect  of 
the  LF-VBM  that  keeps  the  model  from  being  blindly  appli¬ 
cable  is  the  fact  that  the  strong  fault  modes  of  the  conditional 
statements  decouple  the  selection  of  the  conditional  branch 
to  be  executed  from  the  evaluation  of  the  selection  condition. 
Therefore,  faults  in  the  condition  cannot  be  located  using  the 
LF-VBM.  Fortunately,  such  faults  can  in  many  cases  be  found 
with  the  VBM  alone  and  do  not  require  the  LF-VBM  to  be 
applied. 

In  case  of  dependency-based  models  additional  tests  have 
been  carried  out  to  examine  the  overall  debugging  perfor¬ 
mance  of  the  Jade  tool.  As  Table  1  indicates,  the  average 
number  of  user  interactions  needed  by  the  Jade  tool  is  sig¬ 
nificantly  smaller  than  the  amount  of  user  interactions  needed 
by  traditional  debugging  tools.  On  average  some  40%  of  user 
interactions  can  be  saved  using  the  Jade  tool.  In  general,  the 
direct  comparison  of  user  interactions  is  problematic,  since 
different  user  interactions  require  different  types  of  inputs 
from  the  user,  which  vary  in  time,  complexity,  and  knowledge 


'43%  is  obtained  when  assuming  the  whole  program  has  to  be 
examined  for  the  examples  where  no  exact  solution  was  found.  Bet¬ 
ter  estimates  (37%)  are  obtained  when  taking  the  percentages  ob¬ 
tained  with  the  VBM  as  upper  bounds. 


needed  by  the  user.  The  numbers  given  in  Table  1  therefore 
include  all  user  interaction  performed  by  the  Jade  system.  If 
only  variable  queries,  i.e.,  the  input  of  a  new  observation  in 
the  form  of  the  value  of  a  certain  variable  at  a  given  source 
code  position,  are  counted,  the  average  amount  of  user  inter¬ 
action  amounts  to  only  35%  of  the  user  interaction  needed  by 
traditional  debugging  tools.  Since  strictly  speaking  all  other 
kinds  of  user  interactions  are  not  included  in  the  reference 
value  of  traditional  debuggers,  this  lower  value  probably  pro¬ 
vides  a  more  accurate  measurement  of  the  debugging  perfor¬ 
mance  of  the  Jade  system. 

Comparison  If  we  compare  the  results  obtained  with 
the  Jade  system  to  results  obtained  with  other  approaches 
for  program  analysis,  it  can  be  seen  that  the  approaches  de¬ 
scribed  herein  are  comparable  and  in  many  cases  even  supe¬ 
rior  to  other  techniques.  When  comparing  our  approach  to 
slicing  [Weiser,  1984],  we  find  that  with  dependency-based 
models  we  yield  similar  results  to  those  obtained  by  slicing 
techniques.  When  value-based  models  are  used,  our  results 
are  much  better,  because  for  most  of  the  example  programs 
used  during  the  evaluation  of  the  value-based  variants,  static 
slicing  is  not  able  to  eliminate  any  statement.  This  can  be 
explained  by  the  different  levels  of  abstraction  applied  by 
dependency-based  models  and  slicing  on  the  one  hand  and 
value-based  diagnosis  techniques  on  the  other  hand.  The 
value-based  approach  is  somewhat  closer  to  the  actual  execu¬ 
tion  semantics  of  the  program  than  with  both,  program  slicing 
and  dependency-based  models.  Another  improvement  with 
respect  to  slicing  is  that  we  can  provide  more  information  to 
the  user,  if  a  loop  has  to  be  executed  a  different  number  of 
times  to  explain  a  fault.  Those  examples  where  no  statements 
of  the  program  can  be  eliminated  are  programs  that  are  either 
very  short  (consisting  of  only  an  initialization  statement  and 
a  loop)  or  programs  where  almost  every  part  of  the  program 
depends  on  every  other  part  (for  example  a  binary  search  tree, 
where  the  program  execution  depends  on  the  values  that  were 
inserted  previously). 

5  Ongoing  Work 

Although  the  results  presented  in  the  previous  section  are 
already  promising,  there  remain  topics  for  further  research. 
This  section  discusses  possible  enhancements  of  the  models, 
to  avoid  some  of  the  drawbacks  mentioned  in  Section  4. 

First,  no  single  model  is  able  to  efficiently  locate  faults. 
Rather,  a  combination  of  models  has  to  be  applied  to  perform 
efficient  reasoning.  This  multi-model-reasoning  approach  is 
not  only  applicable  to  a  single  level  of  abstraction,  as  in 
the  case  of  the  VBM  and  the  LF-VBM,  but  can  also  be  ap¬ 
plied  using  multiple  levels  of  abstraction  or  types  of  models. 
For  example,  the  dependency-based  models  can  be  used  to 
narrow  the  region  of  interest  and  then  apply  combinations 
of  the  VBM  and  the  LF-VBM  to  exactly  locate  the  fault. 
Also,  models  dealing  with  structural  faults  [Jackson,  1995; 
Wotawa,  2000]  or  various  special-purpose  models  (e.g.,  to 
locate  faults  in  loops,  selection  statements,  etc...)  could  be 
incorporated  in  such  a  framework. 

For  this  approach  to  be  applicable,  suitable  strategies  to  de¬ 
cide  under  which  conditions  to  apply  certain  kinds  of  models 


have  to  be  developed  and  evaluated.  Based  on  these  criteria, 
the  most  efficient  model  can  be  selected  based  on  the  pro¬ 
gram  structure,  the  test  cases  and  the  diagnoses  computed  so 
far.  This  approach  overcomes  the  drawbacks  of  the  models, 
as  well  as  reduces  the  computational  complexity  of  the  di¬ 
agnostic  process,  because  models  are  only  instantiated  when 
needed.  To  select  candidates  for  further  inspection,  suitable 
criteria  for  ranking  diagnoses  according  to  their  likelihood  to 
explain  the  fault  have  to  be  developed. 

As  far  as  the  fault  classes  which  can  be  located  with  the 
Jade  environment  are  concerned,  it  should  already  have  be¬ 
come  clear  that  we  are  interested  in  source  code  bugs  which 
become  observable  as  failures  or  output  errors  and  manifest 
themselves  as  logical  faults  in  the  analyzed  source  code.  This 
explicitly  excludes  compile-time  and  run-time  failures  as  well 
as  faults  leading  to  the  non-termination  of  a  program.  For  a 
discussion  about  the  fault  classes  handled  by  the  Jade  sys¬ 
tem  we  divide  the  class  of  analyzed  faults  into  functional  and 
structural  faults.  Functional  faults  are  all  faults,  which  result 
in  a  certain  variable  storing  an  incorrect  value  in  at  least  one 
possible  evaluation  trace.  In  particular,  these  faults  include 
the  use  of  incorrect  operators  or  the  specification  of  incor¬ 
rect  literals,  such  as  integer  or  boolean  constants.  Since  these 
faults  do  not  alter  the  structure  of  the  program,  faults  belong¬ 
ing  to  this  class  can  generally  be  found  with  the  Jade  de¬ 
bugging  environment,  once  they  become  observable  through 
a  test  case  leading  to  an  incorrect  variable  value. 

Structural  faults,  on  the  other  hand,  are  source  code  bugs 
which  alter  the  structure  of  the  underlying  program.  This  is 
the  case  if  the  dependency  graph  [Ferrante  et  al.,  1987]  of 
the  program  is  not  structurally  equivalent  to  the  dependency 
graph  of  the  correct  program.  The  result  of  these  faults  is 
that  the  system  description,  i.e.,  the  model,  differs  from  the 
system  description  obtained  by  the  correct  program.  At  the 
moment  structural  faults  can  only  be  located  under  certain 
circumstances.  A  discussion  about  different  classes  of  struc¬ 
tural  faults  and  how  they  are  handled  by  the  Jade  tool  is  given 
in  [Wieland,  2001].  In  the  future  special-purpose  models  have 
to  be  developed  that  handle  different  kinds  of  structural  faults. 
As  already  discussed,  these  models  then  have  to  be  combined 
with  the  general-purpose  models  described  herein  to  increase 
not  only  the  performance  of  the  Jade  debugger,  but  also  the 
number  of  fault  classes  handled  by  the  tool. 

To  aid  the  programmer  in  correcting  a  fault,  an  intelligent 
debugging  environment  should  be  able  to  provide  possible 
corrections  for  a  faulty  part  of  a  program.  As  described  in 
[Stumptner  and  Wotawa,  1999],  after  a  single  diagnosis  has 
been  selected  for  further  investigation,  possible  replacement 
expressions  for  the  faulty  expression  can  be  inferred  and  pre¬ 
sented  as  corrections. 

Finally,  intuitive  means  for  specifying  the  expected  behav¬ 
ior  of  a  program  have  to  be  developed.  This  includes  the 
construction  of  an  intuitive  graphical  user-interface  through 
which  the  user  can  easily  switch  between  different  levels  of 
abstraction,  test  case  specification,  and  other  representations 
of  the  program  (e.g.,  visualizations  of  heap  structures,  etc.). 


6  Related  Work 

This  section  briefly  summarizes  related  research  in  the  area 
of  program  debugging  and  compares  the  approaches  to  our 
work. 

Weiser’s  slicing  approach  [Weiser,  1984]  is  probably  the 
most  widely  known  approach  to  improve  program  debugging. 
His  approach  relies  on  the  program  dependencies  and  tries  to 
eliminate  those  parts  of  a  program  that  cannot  contribute  to  an 
observed  faulty  program  behavior.  This  approach  is  compara¬ 
ble  to  the  dependency-based  models  presented  here.  Details 
on  the  relationship  between  these  approaches  can  be  found 
in  [Wotawa,  2001]. 

Shapiro  [Shapiro,  1983]  introduces  a  theoretical  frame¬ 
work  for  algorithmic  program  debugging  and  several  algo¬ 
rithms  suited  to  debug  logic  programs.  However,  the  ap¬ 
proach  suffers  from  heavy  user  interaction,  which  is  unde¬ 
sirable  when  debugging  larger  programs.  In  addition,  the  al¬ 
gorithms  cannot  locate  faults  inside  procedures. 

In  [Console  et  al.,  1993]  the  application  of  model-based  di¬ 
agnosis  to  the  software  domain  has  been  proposed  for  the  first 
time.  This  paper  introduces  a  way  of  using  MBD  by  remov¬ 
ing  and  adding  Horn  clauses  to  Prolog  programs.  Extensions 
of  this  approach  were  developed  in  [Bond,  1994]. 

Liver  [Liver,  1994]  discusses  the  use  of  a  functional  repre¬ 
sentation  in  the  debugging  of  software  to  reduce  the  problem 
of  structural  faults  in  software,  where  statements  are  missing 
or  superfluous  parts  of  a  program  are  the  source  of  errors.  The 
approach  relies  on  symbolic  execution  of  a  functional  speci¬ 
fication,  which  has  to  be  provided  by  the  user. 

Hunt  [Hunt,  1998]  applies  the  idea  of  MBD  to  the  domain 
of  object-oriented  languages  by  building  models  for  programs 
written  in  Smalltalk.  The  model  used  in  this  work  is  based 
on  dependencies  between  instance  variables  and  method  calls 
that  modify  them.  In  contrast  to  our  approach,  [Hunt,  1998] 
is  limited  to  single  faults. 

MBD  concepts  have  also  been  applied  to  VLSI  design  lan¬ 
guages,  in  particular  VHDL  [Lriedrich  et  al.,  1999],  using  pa¬ 
pers  describe  (abstract)  models  used  for  locating  a  concurrent 
statement,  e.g.,  a  VHDL  process,  responsible  for  a  detected 
misbehavior.  The  Jade  project  builds  on  this  work,  but  ex¬ 
tends  the  previous  approaches  by  modeling  of  object-oriented 
features. 

Linally,  Burnell  and  Horvitz  [Burnell  and  Horvitz,  1995] 
present  another  approach  to  program  debugging  using  prob¬ 
ability  measurements  to  guide  diagnosis.  As  this  approach 
relies  on  belief  networks,  which  have  to  be  initialized  by  do¬ 
main  experts,  it  is  doubtable  whether  this  approach  can  be 
applied  to  arbitrary  programs. 

7  Conclusion 

Building  intelligent  debugging  aids  for  programmers  is  an  im¬ 
portant  goal  repeatedly  attacked  by  researchers  during  the  last 
decades.  Unfortunately,  no  generally  applicable  solution  has 
been  found  so  far.  In  this  paper  we  summarize  the  work  done 
during  the  Jade  project  and  discuss  some  results  obtained 
using  the  introduced  model  types.  Besides  the  results,  spe¬ 
cific  advantages  and  disadvantages  of  each  of  the  models  are 


discussed.  Incorporating  these  models  in  a  system  with  multi¬ 
model  reasoning  capability  and  ranking  criteria  for  diagnoses 
holds  the  promise  of  wider  applicability  and  even  better  dis¬ 
crimination.  As  our  approach  clearly  outperforms  classi¬ 
cal  debugging  techniques  for  many  example  programs,  the 
model-based  approach  can  be  considered  a  promising  tech¬ 
nique  that  should  be  further  researched  to  obtain  a  generally 
applicable  debugging  tool. 
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Hybrid  Diagnosis  with  Unknown  Behavioral  Modes 

Michael  W.  Hofbaur1  and  Brian  C.  Williams2 


Abstract.  A  novel  capability  of  discrete  model-based  diagnosis 
methods  is  the  ability  to  handle  unknown  modes  where  no  assump¬ 
tion  is  made  about  the  behavior  of  one  or  several  components  of  the 
system.  This  paper  incorporates  this  novel  capability  of  model-based 
diagnosis  into  a  hybrid  estimation  scheme  by  calculating  partial  fil¬ 
ters.  The  filters  are  based  on  causal  and  structural  analysis  of  the 
specified  components  and  their  interconnection  within  the  hybrid  au¬ 
tomaton  model.  Incorporating  unknown  modes  provides  a  robust  es¬ 
timation  scheme  that  can  cope,  unlike  other  hybrid  estimation  and 
multi-model  estimation  schemes,  with  unmodeled  situations  and  par¬ 
tial  information. 

1  Introduction 

Modern  technology  is  increasingly  leading  to  complex  artifacts  with 
high  demands  on  performance  and  availability.  As  a  consequence, 
fault-tolerant  control  and  an  underlying  monitoring  and  diagno¬ 
sis  capability  plays  an  important  role  in  achieving  these  require¬ 
ments.  Monitoring  and  diagnosis  systems  that  build  upon  the  discrete 
model-based  reasoning  paradigm[8]  can  cope  well  with  complexity 
in  modern  artifacts.  As  an  example,  the  Livingstone  system[22]  suc¬ 
cessfully  monitored  and  diagnosed  the  DS-1  space  probe  in  flight, 
a  system  with  approximately  4S0  modes  of  operation.  However,  a 
widespread  application  of  discrete  model-based  systems  is  hindered 
by  their  difficulty  to  reason  about  the  continuous  dynamics  of  an  ar¬ 
tifact  in  a  comprehensive  manner.  Continuous  behaviors  are  difficult 
to  capture  by  the  pure  qualitative  models  that  are  used  by  the  rea¬ 
soning  engines.  Nevertheless,  additional  reasoning  in  terms  of  the 
continuous  dynamics  is  vital  for  detecting  functional  failures,  as  well 
as  low-level  incipient  (i.e  slowly  developing)  faults  and  subtle  com¬ 
ponent  degradation. 

Hybrid  systems  theory  provides  a  modeling  paradigm  that  inte¬ 
grates  both,  continuous  state  evolution  and  discrete  mode  changes 
in  a  comprehensive  manner.  Recent  work  in  hybrid  estimation)  14, 
16,  24,  9]  attempts  to  overcome  the  shortcomings  of  discrete  model- 
based  diagnosis  cited  above  and  provides  schemes  that  integrate 
model-based  approaches  with  techniques  from  fault  detection  and 
isolation  (FDI)[23,  4]  and  multi-model  adaptive  filtering[13,  11,  10]. 
The  hybrid  estimation  schemes,  as  well  as  their  FDI  and  multi-model 
filtering  ancestors,  work  well  whenever  the  underlying  model(s)  are 
’close'  mathematical  descriptions  of  the  physical  artifact.  They  can 
fail  severely  whenever  unforeseen  situations  occur.  Therefore,  it  is 
essential  to  provide  models  that  capture  the  entire  spectrum  of  possi¬ 
ble  behaviors/modes  whenever  we  use  the  hybrid  estimate  for  closed 
loop  control,  for  instance.  Model-based  diagnosis,  in  contrast,  does 
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not  impose  such  a  strong  modeling  assumption.  Its  concept  of  the 
unknown  mode  allows  diagnosis  of  systems  where  no  assumption  is 
made  about  the  behavior  of  one  or  several  components  of  the  sys¬ 
tem.  In  this  way,  it  captures  unspecified  and  unforeseen  behaviors 
of  the  system  under  investigation.  This  paper  provides  an  approach 
to  incorporate  the  concept  of  an  unknown  mode  into  our  hybrid  es¬ 
timation  scheme  [9],  As  a  result  we  obtain  an  estimation  capability 
that  can  detect  unforeseen  situations.  Furthermore,  it  allows  us  to 
continue  estimation  on  a  degraded  basis.  We  achieve  this  by  causal 
analysis[17,  20],  structural  analysis[7]  and  decomposition  of  the  sys¬ 
tem. 

This  paper  starts  with  a  brief  introduction  to  our  hybrid  systems 
modeling  and  estimation  scheme.  Upon  this  foundation,  we  extend 
hybrid  estimation  to  incorporate  the  unknown  mode  and  demonstrate 
the  underlying  structural  analysis  and  decomposition  task.  Finally,  an 
experimental  evaluation  with  computer  simulated  data  for  a  Martian 
live  support  system  demonstrates  the  advantages  of  this  extended  hy¬ 
brid  estimation  scheme. 

2  Hybrid  Systems 

The  hybrid  automaton  model  used  throughout  this  paper  is  based  on 
[9]  and  can  be  seen  as  a  model  that  merges  hidden  Markov  models 
(HMM)  with  continuous  discrete-time  dynamical  system  models  (we 
present  the  model  on  the  level  of  detail  sufficient  for  this  work  and 
refer  the  reader  to  the  reference  cited  above  for  more  detail). 

2.1  Concurrent  Hybrid  Automata 

Definition  1  A  discrete-time  probabilistic  hybrid  automaton  (PHA) 
A  is  described  as  a  tuple  (x,  w,  F,  T,  Xd,  Ts): 

•  x  denotes  the  hybrid  state  variables  of  the  automaton3,  composed 
of  x  =  {xd}  U  xc.  The  discrete  variable  Xd  denotes  the  mode 
of  the  automaton  and  has  finite  domain  Xd-  The  continuous  state 
variables  xc  capture  the  dynamic  evolution  of  the  automaton,  x 
denotes  the  hybrid  state  of  the  automaton,  while  xc  denotes  the 
continuous  state. 

•  The  set  of  I/O  variables  w  =  U  uc  U  yc  of  the  automaton 
is  composed  of  disjoint  sets  of  discrete  input  variables  (called 
command  variables ),  continuous  input  variables  uc,  and  continu¬ 
ous  output  variables  yc. 

•  F  :  Xd  —>  Foe  U  Fae  specifies  the  continuous  evolution  of  the 
automaton  in  terms  of  discrete-time  difference  equations  Foe  and 
algebraic  equations  Fae  for  each  mode  :c,j  £  Xd.  Ts  denotes  the 
sampling  period  of  the  discrete-time  difference  equations. 

3  When  clear  from  context,  we  use  lowercase  bold  symbols,  such  as  v.  to 
denote  a  set  of  variables  (ui, . . . ,  Vi},  as  well  as  a  vector  [ig , . . . ,  vi]T 
with  components  Vi . 


•  The  finite  set,  T,  of  transitions  specifies  the  probabilistic  discrete 
evolution  of  the  automaton. 

Complex  systems  are  modeled  as  a  composition  of  concurrently 
operating  PHA  that  represent  the  individual  system  components.  A 
concurrent  probabilistic  hybrid  automata  ( cPHA )  specifies  this  com¬ 
position  as  well  as  its  interconnection  to  the  outside  world: 

Definition  2  A  concurrent  probabilistic  hybrid  automaton  ( cPHA ) 
CA  is  described  as  a  tuple  ( A ,  u,  yc,  vs,  v0,  Nx,Ny): 

•  A  =  {Ai,  A 2 , . . . ,  Ai }  denotes  the  finite  set  of  PHAs  that  repre¬ 
sent  the  components  Ai  of  the  cPHA  (we  denote  the  components 
of  a  PHA  Ai  by  xdi,  xci,  udi,  uc;,  y ci,Fi,  Xdi). 

•  The  input  variables  u  =  ud  U  uc  of  the  automaton  consists  of  the 
sets  of  discrete  input  variables  ud  =  udi  U  . . .  U  ud;  (command 
variables)  and  continuous  input  variables  uc  C  uci  U  . . .  U  uc;. 

•  The  output  variables  yc  C  ycl  U  . . .  U  y ci  specify  the  observed 
output  variables  of  the  cPHA. 

•  The  observation  process  is  subject  to  additive,  zero  mean  Gaussian 
sensor  noise.  Ny  :  Xd  — >  IR"'*"1  specifies  the  mode  dependent4 
disturbance  v0  in  terms  of  the  covariance  matrix  R  =  diag(Yi). 

•  Nx  specifies  additive,  zero  mean  Gaussian  disturbances  that  act 
upon  the  continuous  state  variables  xc  =  xci  U  . . .  U  xc;.  Nx  : 
Xd  — >  lR,,'xrl  specifies  the  mode  dependent  disturbance  vs  in 
terms  of  the  covariance  matrix  Q. 

Definition  3  The  hybrid  state  X(fcj  of  a  cPHA  at  time-step  k  spec¬ 
ifies  the  mode  assignment  xdi(*.\  of  the  mode  variables  xd  = 
{xdi,  ■  ■  ■  ,xai}  and  the  continuous  state  assignment  xc>(m  of  the 
continuous  state  variables  xc  =  xci  U  . . .  U  xc;. 

Interconnection  among  the  cPHA  components  At  is  achieved  via 
shared  continuous  I/O  variables  wc  £  uciUyCi  only.  Fig.  1  illustrates 
a  simple  example  composed  of  3  PHAs. 


Consider  the  illustrative  cPHA  in  Fig.  1  with 

Ai  =  ({sdl},  {udi,ucl,wcl},  Fi,Ti,  {mn,m12}...) 

A2  =  ({xd2,Xci},{ud2,Wci,yci},F2,T2,{m2i,m22}---) 

A3  =  {{Xd3,  Xc2,Xc3},  {ud2,Ucl,  Pel,  Vc2},  F3,  T3,  {m3l}...). 

Fi,  F2  and  F3  provide  for  a  cPHA  mode  ^-d,(k)  = 

[mn,  m.21,  m3i]r  the  equations 

Fi(mn)  =  {Mcl  =  5.0  Wei } 

F2(m2i)  =  {atci,(fe)  =  0-8  ®ci,(fc-i)  +  *ci,(t-i)i 

Vcl  =  Xcl}  (2) 
F3  (m3 1  )  {atc2,(fe)  —  xc3,(.k—\)  “t“  t/cl,(fc  —  1), 

xc3,(k)  =  0 .4  Xc2,(k- 1)  +  0.5  Wcl,(fc-1), 

Uc 2  =  2.0  Xc.2  +  Xc3}- 

This  leads  to  the  discrete-time  model: 

xci,(k)  =  0.8  a:ci,(fe-i)  +  0-2  uc  1  1)  +  t)si,(fc_i) 

xc2,(k)  =  *cl,(fc-l)  +  xc3,(k-l)  +  Vs2,(k-1) 

xc3,(k)  =  0.4  a;C2,(fc_i)  +  0.5  wci,(fc-i)  +  tA3,(fc-i)  (3) 

Ucl.tk)  xcl,(k)  T"  Vol,(k) 

Vc2,(k)  =  2.0  Xc2,(k)  +  xc3,(k)  +  tt02,(fc) 

2.2  Estimation  of  Hybrid  Systems 

To  detect  the  onset  of  subtle  failures,  it  is  essential  that  a  monitoring 
and  diagnosis  system  is  able  to  accurately  extract  the  hybrid  state  of 
a  system  from  a  signal  that  may  be  hidden  among  disturbances,  such 
as  measurement  noise.  This  is  the  role  of  a  hybrid  observer.  More 
precisely: 

Hybrid  Estimation  Problem:  Given  a  cPHA  CA,  a  sequences 
of  observations  {yc,(o),yc,(i),  ■  ■  ■ ,  yc,(fc)}  and  control  inputs 
{u(o),  um,  . . . ,  U(fc)},  estimate  the  most  likely  hybrid  state 
X(fc)  at  time-step  k. 

A  hybrid  state  estimate  ii(k)  consists  of  a  continuous  state  esti¬ 
mate,  together  with  the  associated  mode.  We  denote  this  by  the  tuple 


Figure  1.  Example  cPHA  composed  of  three  PHAs 


A  cPHA  specifies  a  mode  dependent  discrete-time  model  for  a 
plant  with  command  inputs  ud,  continuous  inputs  uc,  continuous 
outputs  yc,  mode  xd,  continuous  state  variables  xc  and  additive,  zero 
mean  Gaussian  disturbances  vs,  v0.  The  discrete-time  evolution  of 
xc  and  yc  is  described  by  the  nonlinear  system  of  difference  equa¬ 
tions  (sampling  period  Ts) 

xc,(k)  =  f(fc)(xc,(fc_l),  UCj(fc_i))  +  VS!(fc_i) 
y c,(k)  =  g(fc)(xCj(fc),  UCj(fc))  +  V0i(fc). 

The  functions  f(fe)  and  g (*,)  are  obtained  by  symbolically  solving5 
the  set  of  equations  -Fi(®di,(fc))  U  . . .  U  Fi(xdi,(k))  given  the  mode 

Xd,(fc)  [xdl,(k)  i  >  xdl,(k)\ 

4  E.g.  sensors  can  experience  different  magnitudes  of  disturbances  for  differ¬ 
ent  modes. 

5  Our  symbolic  solver  restricts  the  algebraic  equations  and  nonlinear  func¬ 
tions  to  ones  that  can  be  solved  explicitly  and  utilizes  a  Grobner  Basis 
approach[3]  to  derive  a  set  of  equations  of  form  (1). 


X(fc)  •  (xd) (k) ,  xc ,(fc) ,  E(fc) ) , 

where  xCi(q  specifies  the  mean  and  the  covariance  for  the  con¬ 
tinuous  state  variables  xc.  The  likelihood  of  an  estimate  X(*,)  is  de¬ 
noted  by  the  hybrid  belief-state  h(k)  [x]. 

We  perform  hybrid  estimation  as  extended  version  of  HMM-style 
belief-state  update  that  accounts  for  the  influence  of  the  continuous 
dynamics  upon  the  system’s  discrete  modes.  A  major  difference  be¬ 
tween  hybrid  estimation  and  an  HMM-style  belief-state  update,  as 
well  as  multi-model  estimation,  is,  however,  that  hybrid  estimation 
tracks  a  set  of  trajectories,  whereas  standard  belief-state  update  and 
multi-model  estimation  aggregate  trajectories  which  share  the  same 
mode.  This  difference  is  reflected  in  the  first  of  the  following  two 
recursive  functions  which  define  our  hybrid  estimation  scheme: 

=  Pr  (mi  |xj,(fc-i) ,  ud)(fe_i))/i(fc_i)  [xj]  (4) 

7  r -s  I  _  [X-i]Po(y c,(k)  1^4,0)  ?  Uc,(fc)) 

7  p  j  j)  /  j-^ 

ft(.fe)[xj]T’o(yc,(fe)|Xj,(fe),uc,(fe)) 

h(,k)[x-i]  denotes  an  intermediate  hybrid  belief-state,  based  on  tran¬ 
sition  probabilities  only.  Hybrid  estimation  determines  for  each 


Xjjk-i)  at  the  previous  time-step  k  —  1  the  possible  transitions, 
thus  specifying  candidate  successor  states  to  be  tracked.  Consecu¬ 
tive  filtering  provides  the  new  hybrid  state  'X-i^k)  and  adjusts  the  hy¬ 
brid  belief-state  h(k)  [x,]  based  on  the  hybrid  probabilistic  observa¬ 
tion  function  Po(yc,(k)\^i,(k),  uc,(fe))-  The  estimate  i-j^k)  whh  the 
highest  belief-state  h(k)  [xy]  =  max;  (h(k)  [Iq])  is  taken  as  the  hybrid 
estimate  at  time-step  k. 

Tracking  all  possible  trajectories  of  the  system  is  almost  always 
intractable  because  the  number  of  trajectories  becomes  too  large  after 
only  a  few  time-steps.  In  [9]  we  present  an  approximative  anytime 
anyspace  algorithm  that  copes  with  the  exponential  growth,  as  well  as 
the  large  number  of  modes  in  a  typical  concurrent  hybrid  automaton 
model. 

Hybrid  estimation  and  other  multi-model  estimation  schemes  have 
in  common  that  they  require  models  that  are  ’close’  mathematical  de¬ 
scriptions  of  the  system.  They  can  fail  severely  whenever  unforeseen, 
i.e.  unmodeled,  situations  occur.  As  a  consequence,  we  have  to  pro¬ 
vide  models  for  all  operational  modes  as  well  as  an  exhaustive  set 
of  models  for  possible  failure  modes.  Providing  all  possible  failure 
models  can  be  problematic  even  under  the  assumption  of  an  exhaus¬ 
tive  failure  mode  effect  analysis  (FMEA).  For  instance,  consider  an 
incipient  fault  in  a  servo  valve  that  causes  the  valve  to  drift  off  its 
nominal  opening  value.  The  drift  (positive,  negative,  slow,  fast...)  is 
subject  to  the  fault.  It  is  surely  difficult  to  provide  a  mathematical 
model  with  the  correct  parameter  values  that  captures  all  possible 
drift  situations.  Nor  is  it  helpful  to  introduce  a  sufficiently  large  set 
of  modes  that  captures  possible  situations  of  the  drift  fault  as  this 
would  introduce  additional  complexity  for  hybrid  estimation  by  in¬ 
creasing  the  number  of  modes  unnecessarily. 

This  requirement  of  hybrid  mode  estimation  is  in  contrast  to  dis¬ 
crete  model-based  diagnosis  schemes,  such  as  GDE  (e.g.  [5,  6,  19]). 
Model-based  diagnosis  deduces  the  possible  mode  of  the  system 
based  on  nominal  models,  and  few  specified  fault  models  only.  The 
onset  of  possible  fault  scenarios  are  covered  by  the  so  called  un¬ 
known  mode  which  does  not  impose  any  constraints  on  the  system’s 
variables. 

The  next  section  provides  an  approach  that  systematically  incor¬ 
porates  the  concept  of  the  unknown  mode  into  our  hybrid  estimation 
scheme. 

3  Estimation  with  Unknown  Modes 

The  estimation  scheme  [9]  requires  a  fully  specified  mode  assign¬ 
ment  x.di,(k)  f°r  each  candidate  trajectory  that  is  tracked  in  the  course 
of  hybrid  estimation.  Only  a  fully  specified  mode  allows  us  to  deduce 
the  mathematical  model  (1)  for  the  overall  system.  This  model  is  the 
basis  for  the  dynamic  filter  (e.g.  extended  Kalman  filter)  that  is  used 
in  the  course  of  hybrid  estimation. 


Figure  2.  MIMO  filter  (e.g.  extended  Kalman  filter)  for  the  cPHA  example 

For  our  illustrative  3  component  example  introduced  above 
this  would  mean  that  hybrid  estimation  calculates  a  multi-input 


multi-output  (MIMO)  filter  (see  Fig.  2)  for  mode  x.di,(k)  = 
[mn, m2i, m3i]T  based  on  the  mathematical  model  (3).  This  filter 
provides  the  hybrid  state  estimate  X;,(fc)  as  well  as  the  value  for  the 
hybrid  probabilistic  observation  function  Po(yc,(fe)|xi,(fc),  uc,(fc)) 
for  the  hybrid  estimator  (see  Appendix  A  for  the  extended  Kalman 
filter  estimation  details). 

Let  us  assume  the  mode  x.di,(k)  =  [?,m2i,m.3i]T  which  speci¬ 
fies  that  component  1  (.Ai)  is  in  unknown  mode.  A  component  in  un¬ 
known  mode  imposes  no  constraints  (equations)  among  its  variables 
(uci  and  the  internal  variable  wci,  in  our  case).  As  a  consequence, 
we  cannot  deduce  an  overall  mathematical  model  of  the  form  (1)  and 
fail  to  provide  the  basis  for  the  hybrid  estimation  scheme,  the  MIMO 
filter  for  mode  xdij(fc)  =  [?,  m2i,  m3i]T. 


Figure  3.  Example  cPHA  with  explicit  noise  inputs 


However,  a  close  look  on  the  PHA  interconnection  (Fig.  3  -  the 
figure  extends  Fig.  1  by  including  the  implicit  noise  inputs,  as  well 
as  indicating  the  causality  for  the  internal  I/O  variables)  reveals  that 
we  can  still  estimate  component  3  by  its  observed  output  yc 2  and  the 
observation  yci  as  a  substitute  for  the  value  of  its  input.  This  intuitive 
approach  utilizes  a  decomposition  of  the  cPHA  as  shown  in  Fig.  4. 


Figure  4.  Decomposed  cPHA 


The  decomposition  allows  us  to  treat  the  concurrent  parts  of  the 
system  independently  and  calculate  a  filter  cluster  consisting  of  2 
independent  filters.  However,  when  calculating  the  individual  filters 
for  the  cluster,  we  have  to  take  into  account  that  we  use  the  mea¬ 
surement  of  the  input  to  the  third  component  (t/ci)  in  replacement  to 
its  true  value.  This  can  be  interpreted  as  having  additional  additive 
noise  at  the  component’s  input  as  indicated  in  Fig.  4.  The  following 
modification  of  the  covariance  matrix  Q3  for  the  state  variables  of 
^3  takes  this  into  account: 

Q3  =  b3rit>3  +  Q3,  (6) 

where  n  denotes  the  variance  of  disturbance  v0i  and  b3  =  [0,  1]T 


Figure  5.  Decomposed  filter 


denotes  the  input  vector6  of  ^3  with  respect  to  yc\. 

A  filter  cluster  consisting  of  extended  Kalman  filters  and  the 
MIMO  extended  Kalman  filter  are  interchangeable  as  they  provide 
the  same  expected  value  for  the  continuous  state  (E(itc))  whenever 
the  mode  of  the  automaton  is  fully  specified.  However,  the  decom¬ 
posed  filter  has  the  advantage  that  the  probabilistic  observation  func¬ 
tion  Po  of  the  overall  system  is  given  by 

p°  ~  it  -  cd 

3 

where  Poj  denotes  the  probabilistic  observation  function  of  the  j’ th 
filter  in  the  filter  cluster. 

This  factorization  of  the  probabilistic  observation  function  allows 
us  to  calculate  an  upper  bound  for  Po  whenever  one  or  more  com¬ 
ponents  of  the  system  are  in  unknown  mode.  We  simply  take  the 
product  over  the  remaining  filters  in  the  cluster.  This  is  equivalent 
with  considering  the  upper  bounds  of  the  inequalities  Poj  <  1  for 
each  unknown  filter  j.  In  our  example  with  unknown  component  „4i 
this  would  mean: 

Po  <  P02, 

where  P02  denotes  the  observation  function  for  the  filter  that  esti¬ 
mates  the  continuous  state  of  component  ^3. 

The  following  subsection  provides  a  graph-based  approach  for 
filer  cluster  deduction  that  grounds  the  informally  introduced  decom¬ 
position  on  a  more  versatile  basis. 

3.1  System  Decomposition  and  Filter  Cluster 
Calculation 

Starting  point  for  the  decomposition  of  the  system  for  a  cPHA  mode 
Xd  is  the  set  of  equations 

Fi(xdi,(k))  U...U  Fi(xdit(k))  =:  F(xd),  (8) 

where  Fj  (xdjt(k))  returns  the  appropriate  set  of  equations  for  a  com¬ 
ponent  Ai  whenever  xdjdk)  £  Xdj  or  the  empty  set  whenever  the 
component  is  in  unknown  mode,  i.e.  xdj^k)  =?•  Although  we  still 
have  to  solve  the  set  of  equations  to  arrive  at  the  mathematical 
model  of  form  (1)  we  can  interpret  the  set  of  equations  (8)  as  the 

6  In  the  general  case,  we  have  to  calculate  by  for  a  cPHA  component  A, 
and  observed  inputs  uyc  by  linearization,  more  specifically:  = 

dfj  / 9uyc  | .  ,  where  f,  denotes  the  right-hand  side  of 

the  difference  equation  for  component  Aj,  uyc  refers  to  the  observed 
variables  that  are  used  as  inputs  to  the  component  (i.e.  uyc  C  yc)  and 
*-cj,(k—i)  as  W£‘H  as  ]  ;i  represent  the  state  estimate  and  the  contin¬ 

uous  input  for  component  Aj  at  the  previous  time-step,  respectively. 


raw  model  for  the  system  given  mode  xd.  The  following  decom¬ 
position  performs  a  structural  analysis  of  the  raw  model-based  on 
causal  analysis!  17.  20],  structural  observability  analysis[7]  and  graph 
decomposition!  1  ]  • 

A  cPHA  model  does  not  impose  a  fixed  causal  structure  that  spec¬ 
ifies  directionality  of  automaton  interconnections.  Causality  is  im¬ 
plicitly  specified  by  the  set  of  equations.  This  increases  the  expres¬ 
siveness  of  the  modeling  framework  but  requires  us  to  perform  a 
causal  analysis  of  the  raw  model  (8)  as  a  first  step.  The  deduc¬ 
tion  of  the  causal  dependencies  is  done  by  applying  the  bipartite- 
matching  based  algorithm  presented  in  [17],  The  resulting  directed 
graph  records  the  causal  dependencies  among  the  variables  of  the 
system  (Fig.  6  shows  the  graph  for  the  the  illustrative  3  PHA  ex¬ 
ample).  Each  vertex  of  the  graph  represents  one  equation  a  £  T 


Figure  6.  Causal  graph  for  the  cPHA  example 


or  an  exogenous  variable  specification  (e.g.  uc  1)  and  is  labeled  by 
its  dependent  variable  which  also  specifies  the  outgoing  edge  (in  the 
following,  we  will  use  the  variable  name  to  refer  to  the  correspond¬ 
ing  vertex  in  the  graph).  Vertices  without  incoming  edges  specify  the 
exogenous  variables. 

Definition  4  A  causal  graph  of  a  cPHA  CA  at  a  mode  x,/  is  a  di¬ 
rected  graph  that  records  the  causal  dependencies  among  the  vari¬ 
ables  v  £  [J,  xci  U  uc,  U  ycj  of  CA.  We  denote  the  causal  graph 
by  CQ(CA,  xd)  and  sometimes  omit  arguments  where  no  confusion 
seems  likely. 

Goal  of  our  analysis  is  to  obtain  a  set  of  independent  subsystems 
that  utilize  observed  variables  as  virtual  inputs.  Therefore,  we  slice 
the  graph  at  observed  variable  vertices  with  outgoing  edges,  insert  a 
new  vertex  to  represent  a  virtual  input  and  re-map  the  sliced  outgo¬ 
ing  edges  to  this  vertex.  Fig.  7  demonstrates  this  re-mapping  for  the 
causal  graph  of  Fig.  6.  The  observed  variables  are  yc  1  and  yc 2.  Only 
the  vertex  with  dependent  variable  yc  1  has  an  outgoing  edge,  thus  we 
slice  the  graph  at  yc  1  — >  xC2  and  re-map  the  edge  to  the  virtual  input 
uyc  1. 


Figure  7.  Remapped  causal  graph  for  the  cPHA  example 


A  dynamic  filter  (e.g.  extended  Kalman  filter)  can  only  estimate 
the  observable  part  of  the  model.  Therefore,  it  is  essential  to  perform 


an  observability  analysis  prior  calculating  the  filter  so  that  non  ob¬ 
servable  parts  of  the  model  are  excluded.  We  perform  this  analysis 
on  a  structural  basis7. 

Definition  5  We  call  a  variable  v  of  a  cPHA  CA  at  mode  x,/  struc¬ 
turally  obsen’tzble  (SO)  whenever  it  is  directly  observed,  i.e.  v  £  yc, 
or  there  exists  at  least  one  path  in  the  causal  graph  CQ(CA ,  x<j)  that 
connects  the  variable  z  to  an  output  variable  yc  £  yc  of  CA. 

A  filter  estimates  the  state  variables  xc  of  a  dynamic  system  based 
on  observations  yc  and  the  inputs  uc  that  act  upon  the  state  variables 
xc.  The  required  knowledge  about  the  inputs  uc  indicates  that  the 
structural  observability  criteria  is  not  yet  sufficient  to  determine  the 
submodel  for  estimation.  We  have  to  make  sure,  that  no  unknown  ex¬ 
ogenous  input  influences  a  variable.  To  illustrate  this,  consider  again 
the  3  PHA  example  with  mode  x^  =  [?,  m2i,  m3i]T.  Component 
1  in  unknown  mode  omits  the  equation  that  relates  the  variables  uci 
and  Wei .  This  leads  to  a  causal  graph  CQ  (Fig.  8),  where  wci  is  la¬ 
beled  as  exogenous  (no  incoming  edges).  This  unknown  exogenous 
input  influences  the  state  variable  xci  and,  as  a  consequence,  pre¬ 
vents  us  from  estimating  it! 


Figure  8.  Remapped  causal  graph  for  the  cPHA  example  with  unknown 
component  Ai 

We  extend  our  structural  analysis  of  the  causal  graph  by  the  fol¬ 
lowing  criteria: 

Definition  6  We  call  a  variable  v  of  a  cPHA  CA  at  mode  x,/  struc¬ 
turally  determined  (SD)  whenever  it  is  an  input  variable  of  the  au¬ 
tomaton,  i.e.  v  £  uc,  or  there  does  not  exist  a  path  in  the  causal 
graph  CQ(CA,  xj)  that  connects  an  exogenous  variable  ue  $.  uc 
with  v. 

Furthermore,  it  is  helpful  to  eliminate  loops  in  the  causal  graph 
prior  checking  variables  against  both  structural  criteria.  For  this  pur¬ 
pose,  we  calculate  the  strongly  connected  components  of  the  causal 
graph[l]. 

Definition  7  A  strongly  connected  component  (SCC)  of  the  causal 
graph  CQ  is  a  maximal  set  SCC  of  variables  in  which  there  is  a  path 
from  any  one  variable  in  the  set  to  another  variable  in  the  set. 

Fig.  9  shows  the  remapped  causal  graph  for  the  3  PHA  example  after 
grouping  variables  into  strongly  connected  components. 

The  strong  interconnection  among  variables  in  an  SCC  implies 
that: 

1.  Structural  observability  of  variables  in  an  SCC  follows  directly 
from  structural  observability  of  at  least  one  variable  in  the  SCC. 

1  Throughout  the  paper  we  assume  that  loss  of  observability  is  caused  by 
a  structural  defect  of  the  model.  Otherwise,  it  is  necessary  to  perform  an 
additional  numerical  observability  test  [18]  as  structural  observability  only 
provides  a  necessary  condition  for  observability. 


Figure  9.  Causal  SCC  graph  for  cPHA  example 


2.  A  variable  in  an  SCC  is  structurally  determined,  if  and  only  if  all 
variables  in  the  SCC  are  structurally  determined. 

As  a  consequence,  we  can  apply  our  structural  analysis  to  strongly 
connected  components  directly  and  operate  on  the  SCC  graph,  i.e 
a  causal  graph  without  loops.  The  analysis  of  a  strongly  connected 
component  with  respect  to  structural  observability  and  structural  de¬ 
termination  (SOD)  can  be  outlined  as  follows: 

function  determine-SOD-of-SCC(5CC,  uc,  k) 
when  SOD-undetermined?(<SCC) 
if  exogenous?(SCC) 
then  Vi  <—  independent-var(<SCC) 

if  Vi  £  uc  then  SD(SCC)  <—  True 
else  SD(SCC)  —  False 
else  V  <-  uplink-SCCs(SCC) 
loop  for  SCC,  in  V 

do  determi  ne-SOD-of-SCCfiSCC, ,  uc,  k) 

SO  (SCC)  <-  True 

SD(SCC)  <-  all-uplink-SCCs-are-SD?(V) 
cluster-index(iSCC)  <—  k  U  cluster-indices(V) 
SOD-determined(<SCC)  <—  True 
return  Nil 

Our  structural  analysis  algorithm  determines  structural  observabil¬ 
ity  and  determination  (SOD)  of  a  variable  by  traversing  the  SCC 
graph  backwards  from  the  observed  variables  towards  the  inputs. 
In  the  course  of  this  analysis  we  label  non-exogenous  strongly  con¬ 
nected  components  with  an  index  that  refers  to  their  cluster  mem¬ 
bership.  This  indexing  scheme  allows  us  to  cluster  the  variables  into 
non-overlapping  clusters  with  respect  to  the  observed  variables.  The 
direct  relation  between  a  variable,  its  determining  equation,  and  the 
cPHA  component  that  specified  this  equation  leads  to  the  compo¬ 
nent  clusters  sought.  The  structural  analysis  can  be  summarized  as 
follows: 

function  component-clustenng(CA.  x,/) 
returns  a  set  of  cPHA  component  clusters 
yc  <—  observed-vars(C„4) 

CQ  <—  remap-causal-graph(C0(Cv4,  x<j),  yc) 
uc  <—  virtual-inputs(C(y)  U  input- vars(C„4) 

CQscc  <—  strongly-connected-component-graphlCf/) 
fc^O 

loop  for  SCCi  in  output-SCCs(C(5scc,  yc) 
do  determine-SOD-of-SCC(iSCCi,  uc,  k) 
k  <—  k  +  1 

graph-clusters  <—  get-SOD-SSC-clusters(C(?scc) 
return  automaton-clusters(C„4,  graph-clusters) 


Figure  10.  Labeled  and  partitioned  causal  SCC  graph  for  the  3  cPHA 
example 

Each  component  cluster  defines  the  observable  and  determined 
raw  model  for  a  subsystem  of  the  cPHA.  This  raw  model  can  be 
solved  symbolically  and  provides  the  nonlinear  system  of  difference 
equations  (a  model  similar  to  (1),  but  with  the  additional  virtual  in¬ 
puts  )  that  is  the  basis  for  the  corresponding  filter  in  the  filter  cluster. 
In  this  way  we  exclude  the  unobservable  and/or  undetermined  parts 
of  the  overall  system  from  estimation. 

Whenever  a  state  variable  xcj  becomes  unobservable  and/or  un¬ 
determined  (e.g.  due  to  a  mode  change)  during  hybrid  estimation, 
we  hold  the  value  for  the  mean  at  its  last  known  estimate  xcj  and 
increase  its  variance  a |  =  Pjj  by  a  constant  factor  at  each  hybrid 
estimation  step.  This  reflects  a  continuously  decreasing  confidence 
in  the  estimate  xcj  and  allows  us  to  restart  estimation  whenever  the 
variable  becomes  observable  and  determined  again8. 

4  Example  -  BlO-Plex 

Our  application  is  the  BlO-Plex  Test  Complex  at  NASA  Johnson 
Space  Center,  a  five  chamber  facility  for  evaluating  biological  and 
physiochemical  Martian  life  support  technologies.  It  is  an  artificial, 
biosphere-type,  closed  environment,  which  must  robustly  provide  all 
the  air,  water,  and  most  of  the  food  for  a  crew  of  four  without  in¬ 
terruption.  Plants  are  grown  in  plant  growth  chambers,  where  they 
provide  food  for  the  crew,  and  convert  the  exhaled  CO2  into  Oi-  In 
order  to  maintain  a  closed-loop  system,  it  is  necessary  to  control  the 
resource  exchange  between  the  chambers  without  endangering  the 
crew.  For  the  scope  of  this  paper,  we  restrict  our  evaluation  to  the 
sub-system  dealing  with  CO2  control  in  the  plant  growth  chamber 
(PGC),  shown  in  Fig.  11. 

The  system  is  composed  of  several  components,  such  as  redundant 
flow  regulators  (FR1,  FR2)  that  provide  continuous  CO2  supply,  re¬ 
dundant  pulse  injection  valves  (PIV 1 ,  PIV2)  that  provide  a  means  for 
increasing  the  CO2  concentration  rapidly,  a  lighting  system  (LS)  and 
the  plant  growth  chamber  (PGC),  itself.  The  control  system  main¬ 
tains  a  plant  growth  optimal  CO2  concentration  of  1200  ppm  during 
the  day  phase  of  the  system  (20  hours/day). 

Hybrid  estimation  schemes  are  key  to  tracking  system  operational 
modes,  as  well  as,  detecting  subtle  failures  and  performing  diag¬ 
noses.  For  example,  we  simulate  a  failure  of  the  second  flow  reg¬ 
ulator.  The  regulator  becomes  off-line  and  drifts  slowly  towards  its 
positive  limit.  This  fault  situation  is  difficult  to  capture  by  an  explicit 
fault  model  as  we  do  not  know,  in  advance,  whether  the  regulator 

8  Whenever  a  state  variable  xcj  is  directly  observed  we  also  can  utilize  an 
alternative  approach  suggested  in  [15]  that  restarts  the  estimator  with  the 
observed  value,  thus  improving  the  observer  convergence  time. 


Figure  11.  BlO-Plex  plant  growth  chamber 


drifts  towards  its  postitive  or  negative  limit,  nor  do  we  know  the  mag¬ 
nitude  of  the  drift.  A  fault  of  this  type,  which  develops  slowly  and 
whose  symptom  is  hidden  among  the  noise  in  the  system  is  a  typical 
candidate  for  our  unknown-mode  detection  capability.  However,  we 
also  provide  explicit  failure  models  that  describe  typical  situations. 
For  example,  the  PGC  has  4  plant  trays  with  one  illumination  bank 
for  each  tray.  A  black  out  of  one  illumination  bank  can  be  interpreted 
as  a  25%  loss  in  light  intensity.  This  situation  can  be  modeled  explic¬ 
itly  by  a  dynamical  model  that  takes  this  reduced  light  intensity  into 
account. 

In  the  following  we  describe  the  outcome  of  a  simulated  experi¬ 
ment  where  the  flow  regulator  fault  with  drifting  symptom  is  injected 
at  time  point  k  =  700  and  an  additional  light  fault,  that  harms  one 
of  the  four  illumination  banks,  is  injected  at  k  =  900.  The  faults  are 
"repaired"  at  fc  =  1100  and  k  =  1300  for  the  flow  regulator  fault  and 
the  lighting  fault,  respectively.  This  experiment  illustrates  unknown 
mode  detection  and  recovery  from  it,  nominal  failure  mode  detection, 
and  the  multiple  fault  detection  capability  of  our  approach. 


Figure  12.  BlO-Plex  cPHA  model 


The  simulated  data  is  gathered  from  the  execution  of  a  refined  sub¬ 
set  of  NASA’s  JSC’s  CONFIG  model  for  the  BlO-Plex  system[12]. 
Hybrid  estimation  utilizes  a  cPHA  model  that  consists  of  6  com¬ 
ponents  as  shown  in  Fig.  12.  To  illustrate  the  complexity  of  the 
hybrid  estimation  problem  we  should  note,  that  the  concurrent  au¬ 
tomaton  has  approximately  56  «  15000  modes.  Each  mode  de¬ 
scribes  the  dynamic  evolution  of  the  chamber  system  by  a  third  or¬ 
der  system  of  difference  equations.  For  example,  the  nominal  op¬ 
erational  condition  for  plant  growth  is  characterized  by  the  mode 


Xd  =  [mr2,  mr2,  rnvi,  rnv\,  mi2,  rnp2\,  where  mr2  characterizes 
an  partially  open  flow  regulator,  mv  1  a  closed  pulse  injection  valve, 
mi2  100%  light  on,  and  mp 2  plant  growth  mode  at  1200  ppm,  re¬ 
spectively.  This  mode  specifies  the  raw  model: 


Fi(mr2 )  =  {xcl ,(fe)  =  0.5  «ci,(fc-i),  Vc  1  =  Xd} 

F2(mr2)  =  {xc2,(k)  =  0.5  Mci,(fc_i),  yc 2  =  xc2 } 

F3(mv  1)  =  {wC2  =  0.0} 

F±{mvl)  =  {wc3  =  0.0} 

F3(mi2)  =  {wd  =  1204.0} 

Fe(mp2)  =  {£c3,(fc)  =  *c3,(fe-i)  +  20.163- 

[-1.516  ■  10_4/i(«)ci!(fc_i))/2(a:c3i(fe_i))-t- 

3/cl,(fc-l)  +  Vc2,(k-1)  +  tfcl,(fe-l)  +  M'c2,(fe-l)]I 

?/c3  =  Xc3}, 

(9) 


where  /1  and  /2  denotes 


fi(wci)  :=-  7.615  +  0.111  Wd  -2.149-  10-5 
f2(xc3)  :=  72.0  -  78.89  e"^/400  0. 


-t' ci , (fc )  ar*d  xc2,(k)  denote  the  gas  flow  ([g/min])  of  flow  regulator  1 
and  2,  respectively  and  xc3t(k)  denotes  the  CO2  gas  concentration 
([ppm])  in  the  plant  growth  chamber.  wcl  tk)  and  wc2 }(k)  denote  the 
gas  flow  ([g/min])  of  the  pulse  injection  valves  and  wC3,(fe)  denotes 
the  photosynthetic  photon  flux  ([/r-mol/m2s])  of  the  lights  above  the 
plant  trays.  The  nonlinear  expression 


The  causal  graph  (Fig.  13)  of  the  raw  model  (9)  leads  to  the  de¬ 
composition  of  the  system  as  shown  in  Fig.  14  (our  implementation 
of  the  causal  analysis  and  decomposition  algorithms  treats  constant 
values,  such  as  the  value  1204.0  for  the  photosynthetic  photon  flux, 
as  known  exogenous  inputs  with  constant  value).  The  decomposition 
of  the  model  leads  to  a  filter  cluster  with  3  extended  Kalman  filters  - 
one  for  each  flow  regulator  and  one  for  the  remaining  system  (pulse 
injection  valves,  lighting  system  and  plant  growth  chamber).  This 
enables  us  to  estimate  the  mode  and  continuous  state  of  the  flow  reg¬ 
ulators  independent  of  the  remaining  system.  As  a  consequence,  an 
unknown  mode  in  a  flow  regulator  does  not  cause  any  implications 
on  the  estimation  of  the  remaining  system. 


-1.516-  10  4/i(wci!(fe_i))/2(xc3,(fe_i)) 


Figure  14.  Partitioned  causal  SCC  graph  of  the  BlO-Plex  cPHA  model 


approximates  the  CO2  gas  production  [g/min]  due  to  photo¬ 
synthesis  according  to  the  CO2  gas  concentration  and  chamber 
illumination[12].  This  raw  model  defines  a  third  order  system  of 
discrete-time  difference  equations  with  sampling  period  Ts  =  1 
[min] : 

Xcl,(k)  =  0.5  1icl,(fc-l)  +  Vsl,(fc-1) 
xc2,(k)  =  0.5  1icl,(fc-l)  +  Vs2,(k-1) 
xc3,(k)  =  *c3,(fc_i)  +  20. 163[— 1.041+ 

1.141e  xcz,(k)/i00-°  _|_  xcit(k—i)  +  *C2,(fc-i)]  +  vs3,(k-i) 

Vcl,(k)  =  Xclt(k)  +fol,(fc) 

Vc2,(k)  =  XC2  ,(fc)  +  Vo2  ,(fc) 

Vc2,(k)  =  Xc3  t(k)  +  vo3,(k)i 

(11) 


+2 

o 


Fig.  15  shows  the  continuous  input  (control  signal)  uc  1,  observed 
flow  rates  for  flow  regulator  1  and  2  and  the  CO2  concentration  for 
the  experiment.  Both  flow  regulators  provide  half  of  the  requested 
gas  injection  rate  up  to  k  =  700.  At  this  time  point,  the  second  flow 
regulator  starts  to  slowly  drift  towards  its  positive  limit  which  it  will 
reach  at  approximately  k  =  800.  The  camber  control  system  re¬ 
acts  immediately  and  lowers  the  control  signal  in  order  to  keep  the 
CO2  concentration  at  the  requested  1200  ppm  concentration.  This 
transient  behavior  causes  a  slight  bump  in  the  CO 2  concentration 
as  shown  in  Fig.  15-b.  Our  hybrid  mode  estimation  system  detects 
this  unmodeled  fault  at  k  =  727  and  declares  flow  regulator  2  to  be 
in  an  unknown  mode  (we  indicate  the  unknown  mode  by  the  mode 
number  0  in  Fig.  16).  The  flow  regulator  mode  stuck-open  (mr 5)  be- 

Flow  Regulator  2  Estimation  Detail 
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Figure  16.  Mode  estimate  detail  for  flow  regulator  2 


Figure  13.  Causal  graph  of  the  BlO-Plex  cPHA  raw  model  (9) 


comes  more  and  more  likely  as  the  regulator  drifts  towards  its  open 
position.  Hybrid  mode  estimation  prefers  this  mode  as  symptom  ex- 


(a)  Control  input  uc  and  measured  CO2  input  flow  rates  (b)  CO2  level  in  PGC  (measurement  -  gray/green,  estimate  - 

black) 


Figure  15.  Observed  data  and  continuous  estimation  of  the  CO2  concentration  in  plant  growth  chamber 


planation  from  k  =  769  onwards,  although  flow  regulator  2  goes 
into  saturation  a  little  bit  later  at  k  =  800. 

The  light  fault  at  k  =  900  is  detected  almost  instantly  at  k  =  904 
(mu).  This  good  discrimination  among  the  pre-specified  modes 
(failure  and  nominal)  is  further  demonstrated  at  the  termination 
points  of  the  faults.  Repairs  of  the  flow  regulator  2  and  the  lighting 
system  are  detected  immediately  at  k  =  1101  and  k  =  1301,  re¬ 
spectively.  Fig.  17  shows  the  mode  estimation  result  for  the  lighting 
system  and  flow  regulator  2  over  the  entire  experiment  horizon. 


Flow  Regulator  2 


time  [minutes] 


Figure  17.  Mode  estimates  for  flow  regulator  2  and  lighting  system 

5  Implementation  and  Discussion 

The  implementation  of  our  hybrid  estimation  scheme  extends  previ¬ 
ous  work  on  hybrid  estimation  [9]  and  is  written  in  Common  LISR 


The  hybrid  estimator  uses  a  cPHA  description  and  performs  decom¬ 
position  and  estimation,  as  outlined  above.  Decomposition  is  done 
on-line  according  to  the  mode  hypotheses  that  are  tested  in  the  course 
of  hybrid  estimation.  In  general,  it  can  be  assumed  that  the  the  mode 
in  the  system  evolves  on  a  lower  rate  than  the  hybrid  estimation 
rate,  which  operates  on  the  sampling  period  Ta.  Therefore,  we  cache 
recent  decompositions  and  their  corresponding  filters  for  re-use  as 
a  compromise  between  a-priori  calculation  (space  complexity)  and 
pure  on-line  deduction  (time  complexity). 

Optimized  model-based  estimation  schemes,  such  as 
Livingstone  [22],  utilize  conflicts  to  focus  the  underlying  search 
operation.  A  conflict  is  a  (partial)  mode  assignment  that  makes  a 
hypothesis  very  unlikely.  This  requires  a  more  general  treatment 
of  unknown  modes  compared  to  the  filter  decomposition  task 
introduced  above.  The  decompositional  model-based  learning 
system  Moriarty[21]  introduced  continuous  variants  of  conflicts, 
so-called  dissents.  We  are  currently  reformulating  these  dissents  for 
hybrid  systems  and  investigate  their  incorporation  to  improve  the 
underlying  search  scheme.  This  will  lead  to  an  overall  framework 
that  unifies  our  previous  work  on  Livingstone,  Moriarty  and  hybrid 
estimation. 
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A  Extended  Kalman  Filter 

The  disturbances  and  imprecise  knowledge  about  the  initial  state 
xc,(o)  make  it  necessary  to  estimate  the  state  by  its  mean  xc,(fc) 
and  covariance  matrix  P^j.  We  use  an  extended  Kalman  filter[2] 
for  this  purpose,  which  updates  its  current  state,  like  an  HMM  ob¬ 
server,  in  two  steps.  The  first  step  uses  the  model  to  predict  mean 
for  the  state  xCj(,fc)  and  its  covariance  P(>fe),  based  on  the  previous 


estimate  (xCj(fc_;n,  P^.!)},  and  the  control  input  uC  (fc_!): 


xc,(.fc) 

f (xCi(£_i) ,  ric,(fc— l) ) 

<9f 

(12) 

A(fc_i) 

<9x  . 

xc,(fc  —  1)  ’UC|(fc_l) 

(13) 

P(»fc) 

=  ^-(k-i)V(k-i)Sfk-i)  +  Q- 

(14) 

This  one-step  ahead  prediction  leads  to  a  prediction  residual  iv*.) 
with  covariance  matrix  S(m: 


r(fc) 

y  c,(k)  g(xCi(#fc),Uc,(fc)) 

(15) 

CM 

dg 
<9x  - 

XC,(*fc)’UC,(fc) 

(16) 

S(*0 

P(*fc)  ""I" 

(17) 

The  second  filter  step  calculates  the  Kalman  filter  gain  K(fe) 
refines  the  prediction  as  follows: 

,  and 

K(fc) 

—  P(»fc)C(fc)S(fc) 

(18) 

Xc,(fc) 

—  Xc,(«fc)  T"  ^-(k)^(k) 

(19) 

P(k) 

=  [I  —  K(fc)C(fc)]  P(.fc). 

(20) 

The  output  of  the  extended  Kalman  filter,  as  used  in  our  hybrid  esti¬ 
mation  system,  is  a  sequence  of  mean/covariance  pairs  (xc,(fc) ,  P(fc>) 
for  xc  (fe)  as  well  as  the  hybrid  probabilistic  observation  function 

Po(yw|x(/=),uc,w)  =  e-r«sMrw/2.  (21) 
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Abstract.  In  this  paper  we  propose  a  component-based  hybrid  for¬ 
malism,  that  represents  physical  phenomena  by  combining  concur¬ 
rent  automata  with  continuous  uncertain  dynamic  models.  The  for¬ 
malism  eases  the  modeling  of  complex  physical  systems,  and  adds 
concurrency  to  the  supervision  of  hybrid  systems.  Uncertainties  in 
the  model  are  integrated  as  probabilities  at  the  discrete  level  and  in¬ 
tervals  at  the  continuous  level.  Our  modeling  framework  is  rather 
generic  while  focusing  on  the  construction  of  intelligent  autonomous 
supervisors  by  integrating  a  continuous/discrete  interface  able  to  rea¬ 
son  on-line  in  any  region  of  the  physical  system  state-space,  for  be¬ 
havior  simulation,  diagnosis  and  system  tracking. 

1  INTRODUCTION 

In  the  past  few  years,  numerous  works  have  been  presented  to  model 
embedded  systems  with  hybrid  models  and  reason  about  them  for 
simulation,  diagnosis  [9]  or  verification  [1]  purposes.  The  model¬ 
ing  framework  usually  expresses  the  different  operating  modes  of 
the  system  as  a  set  of  finite  automata  and  associates  to  each  mode 
continuous  knowledge  encoded  through  standard  numeric  differen¬ 
tial  equations.  In  this  paper  we  propose  a  component-based  hybrid 
formalism,  that  represents  physical  phenomena  by  combining  con¬ 
current  automata  with  continuous  uncertain  dynamic  models.  How¬ 
ever  it  is  not  sufficient  to  add  continuous  knowledge  to  automata,  be¬ 
cause  moving  between  operating  modes  requires  the  automatic  con¬ 
struction  of  the  structure  of  the  newly  assembled  continuous  model. 
It  means  computing  both  the  characterization  of  the  region  of  the 
state-space  of  the  operating  mode  (denoted  as  a  configuration),  and 
a  proper  causal  ordering  between  the  active  variables  in  that  mode. 
No  pre-study  of  the  behavior  of  the  physical  system  is  required  to 
determine  the  state-space  regions  associated  with  the  current  sys¬ 
tem  configuration(s)  because  the  search  at  continuous  level  is  casted 
into  a  boolean  constraint  satisfaction  problem.  A  reasoning  continu¬ 
ous/discrete  interface  (C/D  I)  is  thus  added,  which  provides  an  on¬ 
line  generation  of  the  characterization  of  the  new  model  structure  by 
making  use  of  enhanced  Truth  Maintenance  techniques  [18]  on  the 
logical  model.  This  is  keypoint  to  achieve  the  diagnosis  of  the  hy¬ 
brid  system  for  which  detection  is  provided  by  the  continuous  layer 
and  state  identification  is  performed  at  the  discrete  logical  level  by 
searching  for  the  current  configuration  consistent  with  observations. 
At  the  same  time,  the  logical  framework  allows  the  description  of 
purely  discrete  component  behavior  in  the  same  manner  as  in  [17]. 
Section  2  describes  the  discrete  and  the  continuous  layers;  Section  3 
presents  the  interface  that  integrates  both  layers  together;  Section  4 
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presents  the  algorithms  required  to  reason  about  hybrid  models  and 
to  track  multiple  trajectories  in  both  simulation  and  diagnosis;  Sec¬ 
tion  5  discusses  our  research,  compares  and  references  some  related 
work. 

2  Hybrid  System  Formulation 

2.1  Hybrid  Systems  as  Transition  Systems 

The  set  of  all  components  of  the  physical  system  to  be  modeled  is 
denoted  by  Comps.  Every  component  in  that  set  is  described  by  a 
hybrid  transition  system.  The  set  of  all  variables  used  to  describe  a 
component  is  denoted  V  and  is  partitioned  in  the  following  manner: 

•  n  =  n M  U  lie  U  Ilcond  U  n_o  —  set  of  discrete  variables  of  4 
distinct  types  (Mode,  Command,  Conditional,  Dependent), 

•  H  =  Hj  U  Ed  —  set  of  continuous  variables  of  2  distinct  types 
(Input,  Dependent). 

Mode  variables  Hm  represent  components  nominal  or  faulty  modes, 
such  as  on  or  stuck.  Command  variables  are  endogeneous  and  ex- 
ogeneous  commands  modeled  as  discrete  events  to  the  system  (e.g. 
software  commands).  Continuous  input  variables  Hj  are  exogeneous 
continuous  signals  to  the  system  determined  by  its  environment  (e.g. 
known  inputs  or  disturbances).  Conditional  variables  Ilcond  are  spe¬ 
cific  discrete  variables  that  represent  conditions  on  continuous  vari¬ 
ables.  Discrete  and  continuous  dependent  variables  are  all  other  vari¬ 
ables.  Finally  the  set  Obs  contains  observable  variables  of  the  phys¬ 
ical  system.  Each  observable  signal  has  an  explicit  sampling  period. 
Our  hybrid  transition  system  is  an  extension  of  the  standard  transi¬ 
tion  system  [8]  that  adds  (qualitative  or  quantitative)  constraints  to 
the  states. 

Definition  1  (Hybrid  Transition  System  -  HTS)  A  Hybrid  Transi¬ 
tion  System  HTS  is  a  tuple  (V,  E,  T,  C,  0)  with: 

•  V  =  n  U  E  —  set  of  all  variables.  \/v  £  V,  the  domain  of  v 
is  D\v\,  finite  for  variables  in  n,  inter\>als  or  real  values  in  5ft 
otherwise. 

•  E  —  set  of  all  interpretations  over  V. 

Each  state  in  E  assigns  a  value  from  its  domain  to  any  variable 
v  £  V. 

•  T  — finite  set  of  transition  variables. 

Each  variable  rm  in  T  ranges  over  its  domain  Z)[rm]  of  possible 
transitions  of  the  mode  variable  m  £  T1m.  Each  in  D[rm]  is 
a  function  :  E  — >  2  ,  associated  to  a  mapping  function  lTi  . 

•  C  —  set  of  (qualitative  or  quantitative )  continuous  constraints 
over  V. 

Each  constraint  c  in  C  at  least  depends  on  one  mode  variable  in 
n m.  Vm  £  nAr,  we  note  C\m ]  the  set  of  constraints  associated 
to  the  variable  m. 


•  0  —  set  of  initial  conditions. 

0  is  a  set  of  assertions  over  V  such  that  they  define  the  set  of 
initial  possible  states,  i.e.  the  set  of  states  s  in  S  such  that  s  |=  0. 

Note  that  in  a  HTS ,  due  to  the  continuous  constraints  in  C ,  some 
transitions  can  trigger  according  to  conditions  over  continuous  vari¬ 
ables.  At  the  discrete/continuous  interface  level,  these  conditions 
have  a  corresponding  discrete  variable  in  II cond,  which  captures 
their  truth  value.  Throughout  this  paper  we  illustrate  the  formal- 


thermostat  T,  with  faulty  modes  stuck.on ,  stuck_off  and  unknown,  as 
well  as  required  transitions.  This  thermostat  switches  according  to 
the  room  temperature  x  (it  should  be  in  its  on  mode  when  the  tem¬ 
perature  x  <  m  to  warm  up  the  room,  and  back  to  its  off  mode  when 
x  >  M  to  cool  it  down),  x  is  hence  influenced  by  the  heater  setting 
temperature  h  (in  mode  on)  or  by  the  outside  temperature  xext  (in 
mode  off).  The  temperature  variation  x  is  observed  through  a  sensor 
with  additive  noise  xnoi.  Initially,  x  =  xext,  the  room  is  closed  and 
the  thermostat  is  on.  Variables  of  both  HTS  are: 


Cfoperi]  :  x  =  aQ0  Ax 

C[closed]  :  x  =  aQc  Ax  where  Ax  =  xe  —  c 
Z 1  :  x  =  aQcAx 
?2  :  x  =  aQ0Ax 


Figure  1.  room  HTS  with  unknown  mode 


R.mode  £  II m  =  ( closed ,  open ,  unknown ) 

R.cmd  £  II c  =  (none,  open,  close ) 

R.c  £  Ucond  —  ( R.x  <  m,  R.x  >  m  A  R.x  <  M,  R.x  >  M) 

R.x  £  'Ejj  £  [—00,  +00] 

R.x  £  jEd  £  [—00, +00] 

R. Ax  £  Sd  £  [—00, +00] 

R.xnoi  £  Ei  £  [-1,1] 

tf.Qc  £  [0.05,0.15] 

R.Qo  £  [0.02,0.05] 

R.a  £  [0.9,  1.1] 

R.Xext  —  4 

T.mode  £  IIm  =  (o//,  on->  stuck-on,  stuck-of  f ,  unknown ) 

T.M  =  17 

T.m  =  10 

T.h  =  20 

Obs  =  {£} 


C[o//]  (C[atucfc_o//])  :xe  = 
C [on]  (C[stK,cfc_07i])  :xe  =  h 

11  :±=  *Qo/c(h  -  T.m ) 

12  :x  =  a,Q0/c(xext  -  T.M ) 


xext 


Figure  2.  thermostat  HTS  with  fault  modes 


ism  and  later  on  the  diagnosis  operation  on  a  simple  example:  figure 
1  shows  the  HTS  of  a  room  R  submitted  to  a  temperature  source. 
It  has  two  nominal  modes:  open  (a  door  or  a  window  is  opened), 
closed,  and  a  faulty  unknown  mode.  The  room  temperature  x  is  influ¬ 
enced  by  the  temperature  of  the  source  xe  according  to  a  first-order 
differential  equation  which  accounts  for  the  room  characteristics  Qc 
(closed)  and  Q0  (open).  The  actions  that  move  the  room  from  one 
mode  to  another  are  modeled  as  observed  single  discrete  commands 
cmd  =  open  and  cmd  =  close.  Figure  2  presents  the  model  of  a 


2.1.1  States  and  Time 

Considerations  about  time  are  central  because  both  the  discrete  and 
the  continuous  frameworks  use  time  representations  that  are  differ¬ 
ent.  At  the  continuous  level,  time  is  explicit  in  the  equations  that 
represent  the  physical  system  behavior,  we  call  it  physical  time  9. 
Physical  time  is  discretized  according  to  the  highest  frequency  sen¬ 
sor,  providing  the  HTS  reference  sampling  period  Ts.  x(kTs),  or 
x(k)  for  short,  specifies  the  value  of  the  continuous  vector  of  state- 
variables  in  S  at  physical  time  kTs.  We  call  abstract  time  the  time 
at  the  discrete  level.  It  is  dated  according  to  the  occurrence  of  dis¬ 
crete  events.  At  date  t,  the  discrete  state  nt.  of  a  HTS  is  the  tuple 
( Mt ,  Qt),  where  Mt  is  the  vector  of  instances  of  mode  variables,  and 
Qt  the  vector  of  instances  of  variables  of  II  in  qualitative  constraints. 
Discrete  state-variables  are  in  II \  Ucond-  Abstract  time  dates  are  in¬ 
dexed  on  physical  time,  which  informs  about  how  long  a  component 
has  been  in  a  given  discrete  state.  If  t.  =  kTa,  then  we  write  the  in¬ 
dexed  date  tk .  When  there  is  no  ambiguity  it  is  simply  denoted  by  t. 
The  hybrid  state  stk  of  a  HT S  is  the  tuple  (ntk ,  x(k)). 

2.1.2  Transitions 

Transitions  describe  changes  between  modes  over  time.  The  transi¬ 
tion  variable  associated  to  a  mode  variable  m  is  denoted  rm  such  that 
its  domain  is  D[rm]  =  { r ^  €  Tjvj  U  { r 4,  €  Tf}  U  {tic!},  with: 

•  T.v  the  set  of  nominal  transitions  that  express  switches  from  one 
nominal  mode  to  another, 

•  Tf  the  set  of  faulty  transitions  that  move  the  HTS  into  a  faulty 
mode, 

•  rzd  the  identity  transition  that  allows  a  HT S  to  stay  in  its  current 
mode. 

Because  transitions  cannot  always  be  considered  as  instantaneous 
against  the  frequency  of  the  sensors,  we  introduce  delays  on  nom¬ 
inal  transitions.  Delay  dTi  is  such  that  once  a  transition  is  en¬ 
abled  it  is  triggered  after  dTi  Ts,  i.e.  after  dTi  physical  time  units. 


While  a  transition  is  enabled  and  waiting  for  its  delay  to  expire,  it  is 
said  to  be  in  standby.  For  a  matter  of  simplification,  the  delay  will  be 
referred  as  d  when  there  is  no  ambiguity.  A  delay  on  transition  can 
also  be  modeled  by  adding  modes  and  clocks  to  the  hybrid  transition 
system  [4],  We  do  not  use  this  representation  here  because  we  think 
that  it  does  not  enforce  the  easy  representation  of  a  component  as  a 
transition  system  by  creating  modes  that  are  irrelevant  for  the  diag¬ 
nosis  purpose.  To  model  faults,  we  define  fault  modes  of  which  we 
know  the  behavior,  such  as  stuckjon  or  stuck_off,  and  a  unique  mode 
unknown  that  is  rather  specific  because  it  has  no  constraints  and  cov¬ 
ers  all  interpretations  in  E.  Modeled  faults  are  often  abrupt  faults  in 
the  sense  that  they  do  not  represent  tenuous  parameter  changes.  Thus 
fault  transitions  have  no  delay,  i.e.  their  duration  is  one  physical  time 
unit. 

Definition  2  (pre  and  post  assertions)  For  a  given  transition  r'rn 
and  a  given  state  stk  £  E,  we  note  assertions  pre (r^)  =  m3  A 
^ricucond  andpost(rzm)  =  mJ  where: 

•  m?  and  m3  are  two  instances  of  the  mode  variable  m, 

•  <j> nCUCond  is  a  logical  condition  over  instances  of  variables  of 
both  lie  and  II Cond- 

We  refer  to  the  guard  of  a  transition  as  the  condition  statement 
<j> nCuCond  that  triggers  the  transition.  Only  fault  transitions  can  be 
spontaneous,  so  their  guard  can  be  always  true.  Traditionnally,  prob¬ 
abilities  are  also  attached  to  every  nominal  and  faulty  transitions.  In 
our  example,  T  is  represented  as  follows  (O  is  the  next  operator 
from  temporal  logic): 

R.T^Lorn  :  R.mode  —  closed  A  R.cmd  —  open  O  R.mode  =  open 

R.T^om  •  R-mode  —  open  A  R.cmd  —  close  O  R.mode  —  closed 

R.t fail  '•  R-mode  =  open  V  R.mode  =  closed  O  R.mode  =  unknown 


T.r^or 

n  :  T.mode  - 

-  off  A  R.x  <  m 

o 

T.mode  = 

on 

TT.tIc 

,m  :  T.mode 

—  on  A  R.x  >  M 

o 

T.mode  — 

off 

til  :  T.mode 

=  on  A  R.x  >  M 

o 

T.mode  = 

stuck -O  f  f 

T-r2fa„ 

it  :  T.mode  - 

-  off  A  R.x  <  m 

o 

T.mode  = 

stuck-on 

T'Xf  a, 

;l  :  T.mode  —  on 

o 

T.mode  — 

stuckjon 

T-r4faU 

:  T.mode  —  off 

o 

T.mode  = 

stuck  _o  /  / 

fail  •  T.mode  =  on 

V  T.mode  —  off 

o 

T.mode  — 

unknown 

There  is  no  delay  when  the  thermostat  (room)  switches  between  on 
(open)  and  off  ( closed )  modes. 

2.2  Moving  between  modes 

When  a  transition  triggers,  the  component  switches  from  one  mode 
to  another,  the  corresponding  HT S  needs  to  transfer  its  continuous 
state  vector  x  as  well.  For  that  reason  each  transition  is  associated 
with  a  mapping  function  lTi  :  E  — >  E  over  the  dependent  variables 
in  V .  It  initializes  the  value  of  a  subset  of  variables  in  the  hybrid 
state  resulting  from  applying  rf,  to  stk  where  l  is  the  abstract  time 
index.  Other  variables  in  stk  keep  their  previous  value.  The  iden¬ 
tity  mapping  function  is  denoted  lld.  Triggering  a  transition  is  a  two 
steps  operation  [1],  First,  mode  change  is  performed  by  applying  the 
transition  t ^  to  the  current  hybrid  state  and  moving  to  the  resulting 

Tm 

mode  after  its  delay  has  expired  ( transition  relation  — >): 

rln  £  T,  (stk,s  k+d)  £  S2,  stk  \=pre{r 3m) 

_ 1  i+ 1 _ 1 _  ( j  j 

Trn 

Stk  - *  S.k+d 

*1+1 


Second,  initialization  is  performed  by  making  use  of  the  mapping 
function,  and  physical  time  goes  on  (time-step  relation  A): 

(ntl+1,x(k  +  d))  =  lTi(stk) 

- -g - * -  (2) 

(7Tti+i ,  x(k  +  d))  -+  (t Xtl+t ,  x{9)) 

where  x{6)  is  the  continuous  state  associated  to  the  discrete  state 
ntl+1  over  the  continuous  time  9.  In  the  systems  we  are  interested 
in,  most  of  the  discontinuities  are  driven  by  controller  actions  and 
preserve  the  state  variables  continuity.  In  our  example,  the  tempera¬ 
ture  is  obviously  continuous  when  the  thermostat  switches  from  on 
to  off  and  we  use  the  temperature  T.M  at  this  point  to  compute 
x  =  aQc(xe  —  T.M).  However  it  has  been  shown  in  [10]  that  in 
specific  cases,  retrieving  a  mapping  function  from  the  models  of  both 
considered  modes  is  far  from  trivial  and  requires  deep  understanding 
of  the  physics  of  the  phenomena  abstracted  in  the  discontinuity. 

2.3  Component  modes  behavior 

We  described  how  transitions  express  component's  dynamics  be¬ 
tween  modes.  At  this  point  we  want  to  represent  each  intra-mode 
behavior  with  two  goals  in  mind:  on  the  one  hand  the  representation 
must  encode  the  available  qualitative  or  quantitative  knowledge;  on 
the  other  hand  it  must  be  suitable  for  efficient  reasoning.  For  purely 
discrete  components,  usually  software  drivers  as  well  as  complex 
electronic  devices,  the  behavioral  model  is  given  by  a  set  of  boolean 
constraints  over  nc  U  Hd  that  are  associated  to  each  mode  variable 
value  in  the  same  manner  as  in  [17],  For  continuous  components,  the 
continuous  behavior  is  expressed  by  discrete-time  continuous  con¬ 
straints  over  E.  Each  constraint  is  attached  to  a  mode  of  the  transition 
system.  The  discrete-time  continuous  constraints  are  of  the  following 
standard  form: 

/  x(k  + 1)  =  Ax(k)  +  Y)j= o  rBMk~j) 

\  y(k  +  1)  =  Cx{k  +  1) 

where  x(k),  y(k),  and  it(fc)  represent  the  continuous  state  vector  of 
dimension  n,  ouput  (observed)  variables  vector  of  dimension  p  and 
input  (control)  variables  of  dimension  q  at  time  kTs ,  respectively;  A, 
Bj  and  C  are  matrices  of  appropriate  dimensions.  To  provide  a  suit¬ 
able  framework  for  reasoning,  continuous  constraints  are  encoded  in 
a  specific  two  levels  formalism  [15]  which  includes  a  causal  model 
and  an  analytical  constraint  level.  The  causal  model  is  obtained  from 
equation  (3)  by  expressing  it  as  a  set  of  causal  influences  among 
the  (state,  input  or  output)  variables.  Influences  may  be  of  different 
types:  dynamic,  integral,  static  and  constant.  The  following  definition 
expresses  first  and  second  order  dynamic  influences: 

Definition  3  (Dynamic  influence)  A  dynamic  influence  iij  is  a  tu¬ 
ple  K,  Td,Tr,  cond)  for  first  order  differential  relations  and 

K,  Td,Q,w,  cond)  for  second  order  relations  with  : 

•  fi  £  E  and  £  S  are  two  continuous  variables  such  that 
influences  fj, 

•  K  is  the  parameter  gain,  representing  the  static  gain  of  the  influ¬ 
ence, 

•  Td  is  the  parameter  delay,  representing  the  time  needed  by  to 
react  to 

•  Tr  is  the  parameter  response  time  representing  the  time  needed  by 

to  get  to  a  new  equilibrium  state  after  having  been  perturbed, 

•  f  is  the  damping  ratio  of  the  system, 

•  w  is  the  undamped  natural  frequency  of  the  system, 


RAa. 


•  cond  is  the  parameter  condition  which  specifies  the  logical  con¬ 
dition  under  which  the  influence  is  active,  cond  ranges  over  ele¬ 
ments  ofV. 

The  underlying  operational  model  of  dynamic  influences  is  provided 
by  the  following  equation: 

&(&  +  !)  =  5Z  °p£*(fc-p)+  5Z  bq£i(k  +  l-q)  (4) 

p=0,. . .  ,n  —  1  q=0, ...  ,m 

where  and  £,■  are  continuous  variables,  n  is  the  influence  order 
and  m  <  n  (causal  link).  Usually  an  equation  is  modeled  by  a  set  of 
influences.  When  necessary,  uncertainties  can  be  taken  into  account 
in  the  influence  parameters  and  as  additive  disturbances.  The  first  are 
represented  by  considering  that  parameters  av  and  bq  have  time  in¬ 
dependent  bounded  values,  i.e.  they  are  given  an  interval  value.  The 
latter  can  be  introduced  as  a  bounded  value  constant  influence  act¬ 
ing  on  £, .  From  the  superposition  theorem  that  applies  to  the  linear 
case,  the  computation  of  the  updated  value  of  variable  £,  £  H  in 
an  equation  eq  consists  in  processing  the  sum  of  the  activated  influ¬ 
ences  from  eq  having  exerted  on  during  the  last  time-interval.  The 
prediction  update  of  all  the  state  and  observed  variables  x(k )  and 
y(k)  from  the  knowledge  of  control  variables  u(k)  and  influence 
activation  conditions  is  performed  along  the  causal  model  structure. 
Our  representation  of  uncertainties  leads  to  the  prediction  of  contin¬ 
uous  variable  trajectories  in  the  form  of  bounded  envelopes.  In  other 
words,  the  system  state  x(k)  at  every  time  instant  t  =  kTs  is  pro¬ 
vided  in  the  form  of  a  rectangle  of  dimension  n. 

Definition  4  (Causal  system  description  -  CD)  The  causal  system 
description  associated  to  the  set  of  continuous  constraints  of  a  HT  S 
is  a  directed  graph  G  =  (3,  I)  where  I  is  a  set  of  edges  supporting 
the  influences  among  variables  in  3,  with  their  associated  conditions 
and  delays. 

The  numerical  intervals  obtained  from  equation  (4)  are  refined  at  the 
analytical  model  level  with  global  constraints  by  performing  a  toler¬ 
ance  propagation  algorithm  [6]  on  the  set  of  variables.  Back  to  the 
example,  the  feasible  continuous  states  of  S  are  specified  by  the  in¬ 
fluences  in  each  HTS: 

R.ii  ( static )  :  if  (R.mode  —  closed )  then  R. Ax  9  - R.x 

R..i 2  ( static )  :  if  ( R.mode  =  open )  then  R.  Ax  9  - >^°  R.x 

R.is  ( integral )  :  R.x  9 - -  R.x 

R.ii  ( static )  :  R.x  9  - -  9  R. Ax 

T.i i  (constant)  :  if  ( T.mode  =  on  V  T.mode  —  stuck_on )  then 

T.h  - *  R.  Ax 

T.i  2  ( constant )  :  if  (T.mode  —  off  V  T.mode  —  stuck-of  f)  then 

R.xext  - *  R- Aa: 

T.i3  (constant)  :  T.xnoj  - >  R.x 

Influences  without  explicit  conditions  are  valid  in  all  modes  except 
in  the  unknown  mode.  Figure  3  presents  the  nominal  CD  for  the 
room  and  the  thermostat. 

2.4  Hybrid  Component  System 

Once  components  have  been  modeled  as  HTS,  constituting  a 
generic  reusable  database  of  models,  they  need  to  be  assembled  in  a 
Hybrid  Component  System  to  model  the  entire  physical  plant.  Com¬ 
ponents  are  hence  instantiated.  Within  the  whole  plant  model,  com¬ 
ponents  are  concurrent,  i.e.  able  to  evolve  independently  which  al¬ 
lows  us  to  reason  on  subparts  of  the  model. 


Figure  3.  Causal  nominal  system  description  of  the  thermostat  and  room 
example 

Definition  5  (Hybrid  Component  System  -  HCS)  A  Hybrid  Com¬ 
ponent  System  HCS  is  a  tuple  (Comps,  V,T,,T,C,Q)  with 
Comps  being  a  set  of  n  components  modeled  as  concurrent  hybrid 

transition  systems  Hi  =  (Vi,  £,,  T),  Ci,  O;),  ^UJ=i  n  = 

S  C  ® .  £,,  T  =  U  Tit  C  =  U  Ci,  0  =  Ui 

We  track  the  evolution  of  a  HCS  over  a  temporal  window  in  the  form 
of  a  trajectory  as  a  succession  of  states.  At  each  time-step,  constraints 
and  commands  first  synchronize  on  shared  variables  in  IId,  lie  and 
3  (the  room  and  the  thermostat  share  Ate).  Shared  variables  serve 
as  time-dated  communication  channels  between  automata.  The  au¬ 
tomata  must  nevertheless  synchronize  between  states.  The  synchro¬ 
nization  uses  transitions  and  is  such  that  given  components  of  the 
HCS: 

•  HTS  that  received  a  command  synchronize  on  the  corresponding 
nominal  transition, 

•  non  commanded  HTS  synchronize  on  the  identity  transition  t  . 

When  synchronized,  HT S  instances  are  introduced  into  the  trajec¬ 
tory  whereas  other  HT S  are  not  copied  at  each  time-step.  Intuitively 
we  want  to  only  introduce  the  minimal  subset  of  the  HTS  necessary 
for  tracking  and  diagnosis  purposes.  In  [11]  and  for  discrete-only 
models,  this  subset  is  computed  using  a  pre-compilation  of  prime 
implicants  of  mode  variables.  In  our  implementation,  transitions  syn¬ 
chronize  a  posteriori,  and  only  when  needed  by  the  reasoner  to  oper¬ 
ate.  This  saves  big  amounts  of  memory  as  when  tracking  a  physical 
system  in  its  nominal  long-term  state,  very  few  components  need  to 
be  reintroduced. 

The  concurrency  process  is  complexified  by  the  introduction  of 
delays  on  transitions.  Figure  4  presents  an  example  of  the  synchro¬ 
nization  of  four  concurrent  HTS,  Hi  to  Hi.  Four  transitions  are 
enabled  on  shared  variables  at  time-step  ff  and  synchronize  over  the 
three  next  time-steps  with  different  delays,  except  for  dT2  and  dTi 
that  are  equal.  Hi  and  H2,  as  well  as  H3  and  Ho  have  constraints  that 
share  variables.  Due  to  different  commands,  the  concurrence  makes 
the  four  HTS  change  mode  at  time  ff  whereas  other  HTS  in  the 
model  stay  inactive  (they  are  not  represented  on  the  figure).  Then  the 
synchronization  effort  takes  into  account  delays  of  triggered  transi¬ 
tions  as  well  as  the  links  between  HTS  through  shared  variables: 

•  H2  and  Hi  have  the  same  delay  and  thus  participate  a  same  hybrid 

k+dTo 

state  at  time-step  tl+1  , 

k-\-d 

•  Hi  and  H2  synchronize  at  tl+2  T1 .  This  is  done  with  the  identity 
transition  on  Ho. 

k-\-d 

•  Hi  (or  H2)  and  Hi  don’t  synchronize  at  tl+2  T1  because  they 
don’t  share  any  variables, 

k-\-d 

•  Hi  and  H2  share  variables  but  don’t  synchronize  at  tl+1  T2  be¬ 
cause  ti  is  already  in  standby. 


The  last  remark  is  of  importance  because  it  relies  on  the  hypothesis 
that  we  cannot  track  or  diagnose  a  physical  component  while  it  is 
switching  from  one  mode  to  another,  i.e.  when  one  of  its  transitions 
is  in  standby,  as  the  required  transient  models  are  often  unknown  or 
too  complex.  The  consequence  is  that  components  only  synchronize 
in  their  non-standby  states. 
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Figure  4.  synchronization  over  3  states  of  four  HT S. 


3  Continuous/Discrete  Interface 

3.1  Configurations 

Depending  on  the  mode  at  a  given  time,  a  HCS  has  its  hybrid  state 
that  ranges  over  several  continuous  regions.  These  regions  are  known 
to  be  difficult  to  determine  and  compute,  if  not  undecidable.  We  pro¬ 
pose  an  on-line  mechanism  to  keep  track  of  the  state-space  partition 
by  sheltering  every  continuous  functional  piece  with  a  conjunction 
of  logical  conditions  we  denote  as  a  configuration. 

Definition  6  ( HCS  configuration)  A  configuration  for  a  HCS  at 
time-step  tk  is  a  logical  conjunction  Stk  =  (/\i  m')  A  (/\  .  nJjond) 
where  the  ml  are  instanciations  of  component  modes  in  II m  and  the 
fl 3Cond  are  variables  of  He  and- 

The  configurations  are  automatically  drawn  from  conditions  on  both 
transition  guards  and  influences  that  define  structural  changes  in  the 
model.  A  configuration  can  be  attached  to  one  or  more  modes  in 
Hm-  In  our  example,  the  continuous  state  is  easily  partitioned  by  the 
thermostat’s  transitions  into  three  regions  determined  by  the  three 
conditions  on  variable  x,  defining  27  configurations: 

Ci  :  R.mode  —  closed  A  T.mode  —  on  A  R.x  <  m 

C2  :  R.mode  =  closed  A  T.mode  =  on  A  ( R.x  >  m  A  R.x  <  M ) 

C 3  :  R.mode  =  closed  A  T.mode  =  off  A  ( R.x  >  m  A  R.x  <  M) 

C4  :  R.mode  —  closed  A  T.mode  —  off  A  R.x  >  M 


Whatever  the  complexity  of  the  conditions  defining  the  regions  of 
the  physical  system,  it  is  easy  to  logically  express  any  condition  as  a 
boolean  variable  of  II cond,  whose  1/0  corresponds  to  the  condition 
and  its  negation.  This  however  leads  to  a  number  of  partitions  that 
is  not  optimal  relatively  to  the  exact  number  of  state-space  regions 
in  which  the  physical  system  evolves.  Note  that  the  configuration 
associated  to  the  unknown  mode  encompasses  the  overall  state-space. 

3.2  Causal  ordering  for  static  equations 

When  switching  from  one  mode  to  another,  some  equations  and  vari¬ 
ables  are  added  or  retracted  according  to  the  new  configuration.  Con¬ 
sequently,  due  to  the  possible  presence  of  static  continuous  equations 


in  the  model,  a  proper  causal  ordering  of  variables  is  to  be  found 
when  entering  the  new  mode.  A  brute  force  approach  would  con¬ 
sist  in  generating  a  new  causal  structure  for  every  different  mode. 
The  problem  of  performing  an  on-line  incremental  generation  of  the 
causal  structure  has  been  previously  addressed  [16]  but  it  is  solved 
here  in  a  slightly  different  manner.  This  is  done  by  first  casting  the 
problem  into  a  boolean  constraint  satisfaction  problem:  every  con¬ 
tinuous  equation  and  variable  in  the  HCS  is  associated  to  boolean 
variables  in  IT  whose  truth  values  state  if  the  variables  or  equations 
are  active  or  not.  Rules  over  the  boolean  variables  are  automatically 
built  to  represent  the  conditions  of  these  activations  and  form  a  logi¬ 
cal  representation  of  the  causal-ordering  problem. 


3.3  Overview 

The  previous  configuration  and  causal  ordering  problems  are  solved 
on-line  by  using  a  truth  maintenance  system  (TMS)  to  reason  on  the 
corresponding  boolean  constraint  satisfaction  problems.  We  use  the 
context  switching  algorithms  of  [18]  because  we  are  not  interested 
in  generating  all  configurations  of  the  physical  system  but  to  switch 
from  one  to  another  as  fast  as  possible.  The  HCS  reacts  to  events, 


Figure  5.  3-layers  interactions 


i.e.  observations  from  sensors  as  well  as  commands,  and  propagates 
them  to  the  model’s  discrete  and  continuous  levels  through  the  logi¬ 
cal  interface  and  the  way  back.  Figure  5  sums  up  these  interactions. 
The  C/D  I,  made  of  the  variables  in  II cond  associated  to  influence 
conditions  and  transition  guards,  as  well  as  the  causal  ordering  log¬ 
ical  model,  ensures  the  logical  consistency  of  the  changes  triggered 
by  the  flow  of  events. 

4  Simulation  and  Diagnosis  of  a  Hybrid 
Component  System 

4.1  Simulation 

A  HCS  simulation  is  a  run  of  concurrent  hybrid  transition  systems 
that  generates  possible  nominal  trajectories  of  the  HCS  according 
to  issued  commands  and  inputs  over  the  time.  The  uncertainty  on 
the  continuous  constraint  parameters  determines  the  precision  of  the 
computed  envelopes  that  enclose  the  observed  behavior  of  the  phys¬ 
ical  system  at  each  time  step. 

Sometimes  the  truth  value  of  a  condition  in  a  configuration  may 
be  undetermined  when  checked  against  a  rectangular  enclosing  of 
the  continuous  state-variables.  The  problem  arises  from  the  fact  that 
some  variables  over  which  configurations  rely  are  not  measured. 
When  the  computed  bounds  of  such  a  continuous  variable  span 
over  more  than  one  configuration  region  relying  on  that  variable,  we 


say  that  the  current  configuration  is  splitting  the  continuous  state  on 
variable  Figure  6  shows  a  configuration  split  for  the  thermostat 

temporal 


Figure  6.  Transition  guard  split 


example  when  crossing  x  =  M.  The  current  configuration  splits  on 
regions  x 1  and  x 2  and  the  two  possible  trajectories  are  tracked  simul¬ 
taneously.  In  applications,  this  situation  happens  rather  frequently 
and  multiple  consecutive  splits  of  a  guard  on  the  same  variable  can 
occur  because  sensor  frequencies  are  usually  beneath  the  tempo¬ 
ral  uncertainty  induced  by  the  envelopes.  We  first  want  to  split  the 
continuous  state  into  logical  branches  then  refine  consequently  the 
bounds  on  all  continuous  variables  in  every  explored  branch.  For  a 
given  continuous  variable  the  logical  split  of  a  configuration  5tk 
returns  the  set  of  possible  configurations  to  be  tracked: 


where  H3Cond^  are  variables  of  II cond  relying  on  £<  and  II c0nd 
other  conditions  in  Stk .  Relation  (5)  is  used  to  compute  the  splitted 
areas  because  it  is  much  faster  than  exploring  the  overall  continu¬ 
ous  state  space.  The  following  algorithm  is  applied  on  every  tracked 
trajectory: 

1.  The  configuration  Stk  is  checked  against  the  rectangular  region 
defined  by  variables'  predicted  envelopes  to  find  a  variable  over 
which  it  is  splitting  the  state-space, 

2.  The  state-space  is  logically  splitted  with  relation  (5).  For  each  con¬ 

figuration  8Jk  in  [<5tii](£i),  its  corresponding  continuous  region  is 
denoted  xJ^_(k)  and  its  corresponding  discrete  state  ,5 . . 

3.  Envelopes  over  variables  in  E  are  refined  in  every  region  a^.  ( k ) 
by  filtering  them  on  the  constraints  defined  by  the  conditions  in 
the  configuration  [6], 

4.  (n3k  ,£i  ,  (k))  constitute  new  hybrid  states  enclosed  in  new  tra¬ 

jectories  to  be  tracked. 

The  three  preceding  steps  are  applied  for  remaining  variables  on  the 
growing  set  of  generated  trajectories.  Finally  the  resulting  set  of  com¬ 
puted  hybrid  states  is: 

(6) 

In  our  example,  the  thermostat’s  configurations  only  split  on  the 
temperature  x.  On  figure  6,  until  time-step  ,  the  configuration  of 
the  HCS  is 

C2  :  R.mode  =  closed  A  T.mode  =  on  A  R.x  >  m  A  R.x  <  M 


At  time-step  ff .  due  to  the  crossing  of  x  =  M,  the  current  configura¬ 
tion  is  splitted  on  x.  A  new  partial  hybrid  state  comes  from  equation 
(5): 

R.mode  =  closed  A  T.mode  =  on  A  R.x  >  M 

Then  bounds  of  variable  x  are  refined  in  each  configuration  by  fil¬ 
tering  the  values  with  respective  constraints  R.x  >  m  A  R.x  <  M 
and  R.x  >  M.  As  transition  T.r^om  turns  enabled  with  the  second 
configuration,  the  configuration  is  instantaneously  (T.r^om  has  no 
delay)  updated  to: 

Ca  :  R.mode  =  closed  A  T.mode  =  off  A  R.x  >  M  (7) 
Front  that  point  the  system  tracks  two  distinct  trajectories. 

4.2  Fault  Detection 

The  detection  algorithm  then  uses  the  above  prediction  of  the  endo¬ 
genous  continuous  variable  values  to  obtain  robust  decisions  about 
the  existence  of  faults,  based  on  adaptive  thresholds  provided  by  the 
envelopes’  upper  and  lower  bounds.  This  is  performed  by  comparing 
the  predicted  and  observed  values  of  variables  across  time.  The  adap¬ 
tive  thresholds  principle  fairly  reduces  the  possibility  of  false  alarms 
when  tracking  the  system.  However,  to  achieve  better  robustness,  we 
usually  mark  a  variable  as  mibehaving  after  it  has  been  outside  of  its 
bounds  for  at  least  nmiab  physical  time-steps.  After  that  delay,  the 
diagnosis  operation  is  triggered. 

For  dynamic  influences,  the  algorithm  sensitivity  relies  on  a  mixed 
strategy  which  combines  an  observer  type  strategy  ( closed-loop 
mode,  i.e.  the  measure  of  a  variable  y  at  time  t  is  used  to  elabo¬ 
rate  the  prediction  of  y  at  time  t.  +  1)  with  a  pure  simulation  strategy 
(open-loop  mode,  i.e.  the  prediction  of  y  at  time  t+ 1  is  obtained  from 
the  prediction  of  y  at  time  f)  to  determine  the  thresholds  and  further 
assess  variable  states.  We  call  this  strategy  a  semi-closed  loop  (SCL) 
strategy  [13].  The  mode  control  (open-loop  or  closed-loop)  depends 
on  whether  the  observed  value  of  a  variable  y  is  in  the  predicted  en¬ 
velope  (normal  situation)  or  out  of  it  (alarming  situation).  As  soon 
as  the  variable  becomes  alarming,  running  on  a  closed-loop  mode 
might  drive  the  prediction  to  follow  the  fault,  turning  the  detection 
procedure  insensitive  to  the  fault.  The  prediction  temporal  window  is 
hence  scaled  up  by  switching  to  the  open-loop  mode.  Note  that  the 
fault  detection  mechanism  is  very  efficient  at  ruling  out  wrong  trajec¬ 
tories  issued  front  multiple  successive  splits  on  the  same  boundary 
constraint. 

Figure  7  shows  three  scenarios  with  faults  where  detection  is  ap¬ 
plied.  On  the  first  scenario  the  thermostat  fails  to  switch  at  time-step 
63  and  sticks  to  its  on  mode.  In  the  second  scenario  the  constant  T.h 
is  degraded  from  time-step  46  to  a  lower  value,  so  the  heater  is  slower 
to  warm  the  room.  Scenario  three  presents  a  fault  characterized  by  an 
abrupt  structural  change  in  the  thermostat  model.  For  all  scenarios, 

n^jiisb  —  1. 

4.3  Diagnosis 

When  a  fault  is  detected,  a  diagnosis  comes  back  to  find  the  cur¬ 
rent  configuration  of  the  HCS  according  to  observations,  inputs  and 
commands.  This  must  be  performed  over  a  finite  temporal  window 
[11],  but  because  of  the  fault  detection  at  a  continuous  level  the  prob¬ 
lem  of  losing  solutions  is  strongly  reduced.  The  temporal  window  is 
usually  set  up  to  the  physical  time  that  corresponds  to  the  longest 
chain  of  non-repeated  transitions.  In  our  example  20  physical  time- 
steps  cover  an  on-off  complete  sequence. 


(a)  Scenario  1,  x :  After  detection  and  diagnosis,  a  few  more  time-  03)  Scenario  1,  x :  the  fault  is  detected  at  time-step  68. 

steps  are  necessary  for  the  prediction  to  catch  up  with  the  physical 

system.  This  comes  from  the  fact  that  the  estimation  of  the  time 

of  the  fault  is  not  accurate  enough:  because  of  the  time  uncertainty 

due  to  the  envelopes,  the  estimation  is  a  few  time-steps  late. 


(c)  Scenario  2,  x:  After  the  fault  is  diagnosed,  the  blind  state-  (d)  Scenario  2,  x:  The  fault  is  not  so  abrupt  as  to  be  detected  in- 

tracking  method  uses  the  nominal  behavior  of  the  thermostat  and  stantaneously.  Measures  go  in  the  predicted  bounds  again  at  time- 

predicts  all  possible  switches  at  each  time-step:  the  very  wide  en-  step  69.  This  is  due  to  the  fact  that  when  using  the  blind  state- 

velope  shows  that  it  is  not  sure  if  the  thermostat  is  on  or  off.  tracking  method,  the  thermostat’s  controller  model  is  still  switch¬ 

ing  on  valid  thresholds. 
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(e)  Scenario  3,  x:  The  thermostat  switches  on  valid  thresholds  and  (f)  Scenario  3,  x:  After  a  thermostat’s  structure  change,  the  heater 

the  blind  state-tracking  method  keeps  a  relatively  good  tracking  of  setting  temperature  T.h  is  oscillating.  When  turned  off,  T  keeps 

the  temperature  after  the  fault  occured.  This  is  due  to  the  fact  that  its  nominal  behavior, 
the  physical  model  of  the  room  is  still  valid. 


Figure  7.  Three  fault  scenarios 


Definition  7  ( HCS  Diagnosis)  A  diagnosis  diag(t)  over  m  time- 
steps  for  a  HCS  is  such  that  diag{t)  =  with  the  con- 

sitency  of: 


Solving  relation  (8)  is  a  three  steps  operation.  First,  existing  conflicts 
(a  set  of  influences  which  cannot  be  unfaulty  altogether)  are  exhib¬ 
ited  from  the  causal  system  description  (CD)  of  the  HCS ,  each  in¬ 
fluence  stamped  with  a  temporal  label  and  activation  condition.  They 
are  then  turned  into  diagnosis  candidates  by  a  failure-time  oriented 
enhanced  version  of  the  hitting  set  algorithm  [14].  Temporal  infor¬ 
mation  is  drawn  from  maximizing  on  each  components  the  delays  of 
the  influences  downstream  the  faulty  variables  in  CD. 

Second,  at  the  configurations  level,  the  TMS  negates  the  activation 
conditions  of  the  conflicting  influences  and  fastly  iterates  through  the 
logical  remaining  configurations  to  reinsure  the  consistency.  Finally, 
every  found  configuration  is  checked  against  the  past  observations 
over  the  temporal  window  before  being  approved  as  in  [11]  except 
that  candidate  generation  and  consistency  checks  are  interleaved  and 
run  from  present  time  back  to  the  beginning  of  the  temporal  window. 
Configuration  solutions  to  the  diagnosis  problem  contain  a  mode  in- 
stanciation  of  every  necessary  component  in  the  HT S  explaining  the 
observations.  Note  that  on  figure  7,  for  all  three  scenarios,  the  diag¬ 
nosis  operation  is  performed  in  less  than  0.1  seconds  on  a  Pentium  II 
300  Mhz,  which  is  beneath  the  measures’  frequency,  so  the  detection 
time-step  is  equal  to  the  diagnosis  time-step. 

4.3.1  Diagnosis  example  with  a  fault  mode 

When  applied  to  the  first  scenario,  the  diagnosis  starts  as  soon 
as  x  goes  out  of  its  bounds  for  all  currently  tracked  trajecto¬ 
ries:  iterating  through  the  system  nominal  CD  from  figure  3,  at 
timestep  68  the  influences  in  conflict  are  F  =  {TA3,  T.i 2,  R.i  1, 
RA3,  RA4}.  Relatively  to  the  current  configuration  (7)  it  is  equiv¬ 
alent  to  add  the  constraints  Fc  =  IV  ,  ,  1  T.mode  = 

ml,  R.mode  =  closed, T.mode  =  off  V  T.mode  = 
stuck-of  f ,  Vm36D[j{  mode]  =  m3 }  which  are  activation 

conditions  on  the  influences  in  conflict.  As  RA4  has  a  delay  of  1,  the 
elements  of  the  last  conflict  are  stamped  with  the  current  physical 
time  minus  1.  Other  conflicts  elements  are  stamped  with  the  current 
physical  time. 

The  TMS  then  seeks  for  consistency  on  both  the  configurations 
and  the  transition  model  starting  from  the  current  configuration  by 
inserting  the  negation  of  the  elements  in  Fc:  =  {T.mode  = 

unknown,  R.mode  =  open  V  R.mode  =  unknown,  T.mode  = 
on  V  T.mode  =  stuck-on  V  T.mode  =  unknown,  R.mode  = 
unknown}  and  returns  the  following  possible  configurations  ranked 
according  to  the  probabilities  attached  to  transitions  and  to  the  num¬ 
ber  of  faults  leading  to  them: 

1  :  ( R.mode  =  closed )  A  ( T.mode  —  stuck-on)  A  ( R.x  >  M ) 

2 a  :  ( R.mode  =  closed )  A  ( T.mode  =  unknown)  A  ( R.x  >  M) 

2b  :  ( R.mode  =  unknown)  A  ( T.mode  =  stuck-On)  A  ( R.x  >  M) 

3  :  ( R.mode  =  unknown)  A  ( T.mode  =  unknown)  A  ( R.x  >  M) 

Other  configurations  with  the  thermostat  in  modes  on,  stuck-ojf,  or 
the  room  in  mode  open  are  ruled  out  during  the  search  process  be¬ 
cause  there  are  no  transitions  or  past  observations  and  commands 
consistent  with  these  configurations.  Diagnosis  1  fits  with  the  fault 
in  the  first  scenario  (thermostat  took  transition  TfaU).  The  state  vec¬ 
tor  is  reinitialized  according  to  the  mapping  function  of  r|aii  (P“) 
before  the  tracking  continues. 


4.3.2  Diagnosis  example  with  the  unknown  mode 

Scenarios  2  and  3  primarily  lead  to  diagnosis  2 a  where  the  thermo¬ 
stat  is  in  the  unknown  mode.  This  mode  is  useful  at  the  discrete  level 
because  it  assures  that  there  is  always  a  solution  to  the  diagnosis 
problem5.  At  the  continuous  level  however,  it  has  no  model,  so  it  is 
not  possible  to  track  a  HTS  in  that  mode.  Isolating  the  unknown 
automata  so  as  to  continue  the  prediction  of  the  behavior  of  others 
HTS  in  the  model  often  leads  to  tracking  based  on  a  wrong  model: 
in  scenario  2,  once  the  mode  of  T  has  been  diagnosed  to  be  unknown , 
influences  referring  to  T  are  inactive  which  is  equivalent  to  predict 
R' s  behavior  with  T.h  =  0.  Our  current  solution  to  that  problem  is  to 
use  a  dedicated  blind  state-tracking  method  that  is  applicable  thanks 
to  the  semi-closed  loop  fault  detection  strategy  described  in  subsec¬ 
tion  4.2.  When  a  component  is  found  to  be  in  its  unknown  mode,  the 
nominal  model  of  the  component  is  used  instead.  The  detection  mod¬ 
ule  runs  on  open-loop  prediction  mode  until  the  measures  fall  into 
the  envelopes  again.  This  is  guaranteed  to  occur  because  the  open- 
loop  predicted  envelopes  widen  with  time  (uncertainty  propagation 
of  interval  models).  Triggered  by  this  event,  the  detection  module 
then  switches  to  closed-loop  prediction  mode  and  is  able  to  track  the 
system  until  the  measures  get  out  of  their  bounds  again,  and  so  on. 
This  is  the  method  applied  on  scenarios  2  and  3  on  figure  7.  How¬ 
ever  in  scenario  2,  an  improved  solution  could  be  to  use  parameter 
estimation  techniques  as  proposed  in  [9]  because  the  structure  of  the 
model  is  still  valid.  But  drawbacks  are  the  additional  computational 
cost  and  the  fact  this  would  leave  the  system  untracked  for  a  period 
of  time  (proper  parameter  estimation  requires  to  wait  for  properly 
excited  data).  More  research  is  needed  to  integrate  existing  parame¬ 
ter  estimation  and  model  fitting  techniques  into  our  framework.  Also 
note  that  such  faults  generally  result  from  the  natural  degradation  of 
the  monitored  physical  system  and  could  be  taken  into  account  in 
causal  models  [12]. 

5  Summary,  Discussion  and  Related  Work 

In  this  paper  we  extend  previous  work  on  diagnosis  in  the  AI  com¬ 
munity  by  presenting  a  formalism  that  merges  concurrent  automata 
with  continuous  dynamic  system  models  and  reasons  about  its  con¬ 
figurations  using  logical  tools.  The  problem  of  reasoning  about  and 
diagnosing  complex  physical  plants  without  computing  their  contin¬ 
uous  reachable  state-space  is  addressed.  The  approach  integrates  nu¬ 
merous  techniques  from  different  fields  into  a  runnable  standalone 
application,  which  is  able  to  deal  with  real-world  problems  such  as 
satellite  state-tracking  [3].  The  modeling,  simulation  and  diagnosis 
tools  are  implemented,  including  the  engine  that  splits  the  configu¬ 
rations.  The  program  generates  a  C++  runtime  that  is  intended  to  be 
demonstrated  on  an  autonomous  spacecraft  test  bench  at  CNES. 

Other  formalisms  for  building  comprehensive  and  tracktable  hy¬ 
brid  systems  include  [10]  and  [4],  But  none  of  these  approaches  pro¬ 
vide  an  intuitive  component-based  framework  allowing  engineers  to 
build  reusable  models  of  equipments.  Moreover  the  models  often  in¬ 
clude  numerous  functional  modes  that  are  irrelevant  to  the  diagnosis 
task.  For  instance  [4]  introduces  additional  modes  to  deal  with  de¬ 
layed  transitions,  and  [10]  rather  focuses  on  the  expression  of  the 
approximations  able  to  produce  sound  hybrid  models  of  complex 
physical  systems.  Besides,  it  examines  types  of  discontinuities  that 
are  rarely  encountered  in  controlled  systems.  In  such  systems,  most 

5  Note  that  the  unknown  mode  is  also  a  dead-end  since  no  nominal  transition 
can  lead  out  of  this  mode. 


of  the  discontinuities  are  driven  by  controller  actions  and  preserve 
state  variables  continuity. 

Our  work  takes  numerous  ideas  from  the  discrete-only  work  at  the 
basis  of  Livingstone  [17,  11]  and  adds  and  links  continuous  knowl¬ 
edge  to  it.  The  difficult  problem  of  the  temporal  window  that  required 
aggregating  in  a  history  all  past  states  in  every  tracked  trajectory 
is  now  strongly  reduced  as  it  is  less  likely  that  a  wrong  trajectory 
is  tracked  without  detecting  anomalies  at  the  continuous  level.  [9] 
introduced  a  diagnosis-dedicated  hybrid  formalism  relying  on  error 
bounds  for  the  detection  parts,  but  without  concurrence  nor  transi¬ 
tions  triggered  autonomously  from  the  continuous  level;  it  uses  prob¬ 
abilities,  parameter  estimation  as  well  as  data  fitting  to  refine  the  di¬ 
agnosis.  [20]  unifies  traditional  continuous  state  observers  with  hid¬ 
den  Markov  models  belief  update  in  order  to  track  hybrid  systems 
with  noise  but  do  not  include  concurrent  models  nor  any  mapping 
function  discussion.  The  approach  is  interesting  because  it  makes 
extensive  use  of  probabilities  where  we  chose  to  rely  on  bounded 
uncertainties  (intervals)  at  the  continuous  level  and  on  probabilities 
at  the  discrete  level.  In  fact  these  are  different  uncertainties  as  the 
uncertainty  is  uniformly  distributed  in  the  case  of  intervals  whereas 
[20]  relies  on  normal  laws.  In  our  point  of  view  using  probabilities 
at  the  discrete  levels  allows  to  prune  an  otherwise  prohibitive  search, 
but  intervals  offer  a  more  compact  representation  of  uncertainties  on 
continuous  variables.  However,  the  point  would  need  more  discus¬ 
sion  and  research.  Similar  approaches  also  include  [21]  that  com¬ 
bines  a  Petri  net  and  signal  analysis  to  estimate  the  discrete  modes 
and  overcome  an  exponential  cost  in  the  number  of  sensors,  but  lacks 
an  efficient  diagnosis  engine;  and  [7]  that  uses  a  dedicated  bayesian 
network  as  well  as  a  method  of  smoothing  that  helps  successfully  di¬ 
agnose  faults  with  a  very  low  belief  state.  Note  that  the  model  check¬ 
ing  community  has  recently  investigated  the  use  of  interval-based 
numerical  models  [5]. 

An  advantage  of  our  approach  is  that  any  type  conditions  as¬ 
sociated  to  transitions  and  influences  (e.g.  continuous  functions  as 
guards)  can  be  modeled  and  tracked  without  being  directly  observed. 
Finally  on-line  performances  can  be  enhanced  as  the  formalism  al¬ 
lows  the  logical  model  to  be  pre-compiled  before  use  by  generating 
prime-implicants  on  transition  guards  [19]  and  influence  conditions. 
However  it  still  happens  that  trajectories  cannot  be  discriminated  due 
to  too  much  imprecision  on  parameters  that  leads  to  overlapping  en¬ 
velopes.  A  solution  to  this  problem  has  been  to  merge  such  envelopes 
and  corresponding  trajectories.  Another  remark  concerns  the  splits 
that  occur  and  are  not  linked  to  any  real  mode  or  structure  changes 
in  the  model:  when  starting  the  thermostat  and  room  models  with 
external  temperature  xext  <  m,  a  split  occurs  when  first  crossing 
at  x  =  m.  These  splits  however  are  sound  and  refine  the  bounds 
on  continuous  variables  as  they  allow  the  system  to  reduce  temporal 
uncertainty  at  the  crossing  point. 

Further  work  will  focuse  on  reconfiguration  by  reasoning  on  con¬ 
figurations  with  the  same  core  algorithms  as  for  diagnosis.  This  will 
be  done  by  identifying  a  set  of  goal  configurations  and  find  under  un¬ 
certainty  a  valid  plan  made  of  least  costly  endogeneous  commands 
to  reach  each  goal.  We  think  that  additive  improvements  could  also 
include  automatic  controller  synthesis  as  in  [2]  as  well  as  parameter 
estimation  based  on  the  causal  structure  of  the  continuous  level  in 
order  to  refine  the  tracking  of  the  system  when  in  its  unknown  mode. 
In  a  near  future  more  results  are  to  come  out  as  our  implementation  is 
intended  to  be  tested  on  spacecraft  models  and  run  on-board  ground 
based  satellite  hardware. 
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Abstract 

Model-based  diagnosis  can  be  formulated  as  the 
combinatorial  optimization  problem  of  finding  an 
assignment  of  behavior  modes  to  all  the  compo¬ 
nents  in  a  system  such  that  it  is  not  only  consis¬ 
tent  with  the  system  description  and  observations, 
but  also  maximizes  the  prior  probability  associated 
with  it.  Because  the  general  case  of  this  problem 
is  exponential  in  the  number  of  components,  we 
try  to  leverage  the  structure  of  the  physical  sys¬ 
tem  under  consideration.  Traditional  dynamic  pro¬ 
gramming  techniques  based  on  the  underlying  con¬ 
straint  network  (like  heuristics  derived  from  maxi¬ 
mum  cardinality  ordering)  do  not  necessarily  sup¬ 
plement  or  do  better  than  algorithms  based  on  using 
truth  maintenance  systems  (like  conflict-directed 
best  first  search). 

In  this  paper,  we  compare  the  two  approaches  and 
examine  how  we  can  incorporate  the  dynamic  pro¬ 
gramming  paradigm  into  TMS-based  algorithms  to 
achieve  the  best  of  both  the  worlds.  We  describe 
an  algorithm  called  hierarchical  conflict-directed 
best  first  search  (HCBFS)  to  solve  a  large  diag¬ 
nosis  problem  by  heuristically  decomposing  it  into 
smaller  sub-problems.  We  also  delve  into  some  of 
the  implications  of  HCBFS  with  respect  to  (1)  pre¬ 
compiling  the  system  description  to  a  form  that  can 
amortize  the  cost  of  a  diagnosis  call  and  (2)  facili¬ 
tating  other  hybrid  techniques  for  diagnosis. 

1  Introduction 

Diagnosis  is  an  important  component  of  autonomy  for  any 
intelligent  agent.  Often,  an  intelligent  agent  plans  a  set  of  ac¬ 
tions  to  achieve  certain  goals.  Because  some  conditions  may 
be  unforeseen,  it  is  important  for  it  to  be  able  to  reconfigure 
its  plan  depending  upon  the  state  in  which  it  is.  This  mode 
identification  problem  is  essentially  a  problem  of  diagnosis. 
In  its  simplest  form,  the  problem  of  diagnosis  is  to  find  a 
suitable  assignment  of  modes  in  which  each  component  of 
a  system  is  behaving  in,  given  some  observations  made  on 
it.  It  is  possible  to  handle  the  case  of  a  dynamic  system  by 
treating  the  transition  variables  as  components  (in  one  sense) 
[Kurien  and  Nayak,  2000]. 


Definition  (Diagnosis  System):  A  diagnosis  system  is  a  triple 
(SD,  COMPS ,  OB 5)  such  that: 

1.  SD  is  a  system  description  expressed  in  one  of  several 
forms  —  constraint  languages  like  propositional  logic, 
probabilistic  models  like  Bayesian  network  etc.  SD  specifies 
both  component  behavior  information  ( SDb )  and  compo¬ 
nent  structure  information  (i.e.  the  topology  of  the  system) 
(SDt). 

2.  COMPS  is  a  finite  set  of  components  of  the  system.  A 
component  compi  (1  <  i  <  \COMPS\)  can  behave  in  one 
of  several,  but  finite  set  of  modes  (Mi).  If  these  modes  are 
not  specified  explicitly,  then  we  assume  two  modes  —  failed 
(AB (compi))  and  normal  (-i AB  (compi)). 

3.  OBS  is  a  set  of  observations  expressed  as  variable  values. 
The  task  of  diagnosis  is  to  “identify”  the  modes  in  which 
individual  components  are  behaving  given  the  system  de¬ 
scription  (SD)  and  the  observations  (OBS). 

Definition  ( Candidate ):  Given  a  set  of  integers 

it  •  ■  ■  i\coMPS\  (such  that  for  1  <  j  <  \COMPS\, 

1  <  ij  <  \Mj\),  a  candidate  Cand(i\  ■  ■  ■  i\coMPS\)  is 

defined  as  Cand(ii  ■  ■  ■  i\coMPS\ )  =  (U1~lMPS'  ( compk  = 
Mk(ik )))■ 

Here,  Mu  ( v )  denotes  the  vth  element  in  the  set  Mu  (assumed 
to  be  indexed  in  some  way). 

2  Diagnosis  as  Combinatorial  Optimization 

Consider  diagnosing  a  system  consisting  of  three  bulbs 
B\ ,  B>  and  B:>  connected  in  parallel  to  the  same  volt¬ 
age  source  V  under  the  observations  of  f(B\),  of  f(B-2) 
and  on(Bz).  AB(V)  A  AB(Bz)  is  a  diagnosis  under  the 
consistency-based  formalization  of  diagnosis  [de  Kleer  et  al., 
1992]  if  we  had  constraints  only  of  the  form  ^AB(B^)  A 
-i AB{V)  -»  B3  =  on.  Intuitively  however,  it  does  not 
seem  reasonable  because  B:>  cannot  be  on  without  V  work¬ 
ing  properly.  One  way  to  get  around  this  is  to  include  fault 
models  in  the  system  [Struss  and  Dressier,  1989].  These  are 
constraints  that  explicitly  describe  the  behavior  of  a  compo¬ 
nent  when  it  is  not  in  its  nominal  mode  (most  expected  mode 
of  behavior  of  a  component).  Such  a  constraint  in  this  exam¬ 
ple  would  be  AB(Bz)  — >  of  f(B^).  Diagnosis  can  become 
indiscriminate  without  fault  models.  It  is  also  easy  to  see 
that  the  consistency-based  approach  can  exploit  fault  models 
(when  they  are  specified)  to  produce  more  intuitive  diagnoses 


(like  only  B\  and  B>  being  abnormal). 

The  technique  of  using  fault  models  however  is  associated 
with  the  problem  of  being  too  restrictive.  It  may  not  model 
the  case  of  some  strange  source  of  power  making  B:i  on  etc. 
The  way  out  of  this  is  to  allow  for  many  modes  of  behavior 
for  the  components  of  the  system.  Each  component  has  a  set 
of  modes  with  associated  models  —  normal  modes  and  fault 
modes.  Each  component  has  the  unknown  fault  mode  with 
the  empty  model.  The  unknown  mode  tries  to  capture  the 
modeling  incompleteness  assumption  (obscure  modes  that  we 
cannot  model  in  the  system).  Also,  each  mode  has  an  associ¬ 
ated  probability  that  is  the  prior  probability  of  the  component 
behaving  in  that  mode.  Diagnosis  can  now  be  cast  as  a  com¬ 
binatorial  optimization  problem  of  assigning  modes  of  behav¬ 
ior  to  each  component  such  that  it  is  not  only  consistent  with 
SD  U  OBS,  but  also  maximizes  the  product  of  the  prior  prob¬ 
abilities  associated  with  those  modes  [de  Kleer  and  Williams, 
1989],  Note  that  the  combinatorial  optimization  formulation 
of  diagnosis  assumes  independence  of  the  behavior  modes  of 
components. 

Definition  ( Combinatorial  Optimization  Characterization)  A 
candidate  H  =  Cand(i\  ■  ■  ■  i\comps\)  is  a  diagnosis  if 
and  only  if  SD  U  H  U  OBS  is  satisfiable  and  P(H )  = 
( H M FS\ P{ cornpk  =  Mk(ik)))  is  maximized. 

There  are  many  other  characterizations  of  diagnoses  based 
on  the  notions  of  abduction,  Bayesian  model  selection,  model 
counting  [Kumar,  2002]  etc.  These  characterizations  (includ¬ 
ing  combinatorial  optimization)  are  mostly  for  choosing  the 
most  likely  diagnosis  and  do  not  incorporate  any  notion  of 
refinement  [Lucas,  1997].  The  combinatorial  optimization 
formulation  to  return  the  most  likely  diagnosis  is  however 
justified,  practical  and  suited  for  a  variety  of  real-life  appli¬ 
cations  [Kurien  and  Nayak,  2000],  It  also  benefits  from  the 
availability  of  computationally  efficient  algorithms  to  solve 
combinatorial  optimization  problems  [Williams  and  Nayak, 
1996], 

3  Computational  Methods 

Definition  ( Combinatorial  Optimization  Problem ):  A  combi¬ 
natorial  optimization  problem  is  a  tuple  (V,  /,  c)  where  (1) 
V  is  a  set  of  discrete  variables  with  finite  domains.  (2)  An 
assignment  maps  each  v  in  V  to  a  value  in  v’s  domain.  (3)  / 
is  a  function  that  decides  feasibility  of  assignments.  (4)  c  is  a 
function  that  returns  the  cost  of  an  assignment.  (5)  We  want 
to  minimize  c(V )  such  that  f(V)  holds. 

In  the  context  of  diagnosis,  the  following  correspondences 
hold:  (1)  V  =  COMPS.  (2)  Domains  correspond  to  modes 
of  behavior  of  components  (3)  An  assignment  is  a  candi¬ 
date.  (4)  c  is  a  simple  cost  model  assuming  independence 
in  behavior  modes  of  components  c(compi  =  Mi(v ))  = 

1°9Pp(c^p~Jm-(v))  ■  Here-  Mi(v*)  is  the  nominal  mode  of 
behavior  of  compf,  P{compi  =  Mi(v*))  >  P(compi  = 
Mi(v))  for  any  v  ±  v*.  c(Cand(ii  ■  ■  ■  i\COMPS\))  = 

Y^k=\MPS^  c(comPk  =  (5)  /  is  the  satisfiability 

of  SD  U  Cand(ii  ■  ■  ■  i\coMPS\ )  U  OBS. 

A  brute-force  method  of  solving  such  a  problem  is  to  use 
a  simple  best  first  search  (BFS)  which  is  clearly  exponen¬ 


tial  in  the  number  of  components.  It  can  however,  be  poten¬ 
tially  improved  by  leveraging  the  structure  of  the  system.  One 
popular  method  of  leveraging  structure  using  the  paradigm 
of  dynamic  programming  is  to  use  heuristics  derived  from  a 
maximum  cardinality  ordering  (m-ordering)  [Tarjan  and  Yan- 
nakakis,  1984]  over  the  constraint  network  relating  the  vari¬ 
ables  of  the  system.  Such  techniques  have  been  used  in  a  va¬ 
riety  of  domains  —  Bayesian  network  reasoning,  constraint 
satisfaction  problems  [Dechter,  1992]  etc.  A  constraint  net¬ 
work  on  the  variables  of  the  system  is  defined  by  having 
the  variables  represent  nodes  and  constraints  in  SD  repre¬ 
sent  hyper-edges.  Any  kind  of  optimization  or  satisfaction 
defined  over  the  variables  can  be  done  in  time  exponential  in 
the  induced  width  of  the  graph  [Dechter,  1992],  Although  the 
induced  width  itself  cannot  be  found  constructively  in  poly¬ 
nomial  time,  heuristics  derived  from  m-ordering  perform  rea¬ 
sonably  well  in  practice.  Throughout  the  rest  of  this  paper, 
we  will  refer  to  all  such  heuristics  as  naive  m-ordering  (naive 
because  they  do  not  supplement  the  power  of  TMS-based  al¬ 
gorithms). 

These  heuristics  however,  may  not  be  directly  beneficial 
or  applicable  when  the  number  of  components  is  somewhat 
lesser  than  the  total  number  of  variables  in  the  system  (which 
is  usually  the  case).  The  induced  width  of  the  constraint  net¬ 
work  relating  all  the  variables  in  a  physical  system  can  easily 
be  much  more  than  the  number  of  components.  A  further  dis¬ 
advantage  of  such  approaches  is  that  often  the  relationships 
between  variables  are  too  complex  and  consistency  checks 
may  involve  some  kind  of  a  “simulation”.  Since  dynamic 
programming  techniques  based  on  these  heuristics  maintain 
and  build  partial  assignments,  they  are  very  likely  to  be  costly 
processes.  Furthermore,  in  many  cases,  the  number  of  faulty 
components  is  usually  far  lesser  than  the  total  number  of  com¬ 
ponents  and  these  techniques  do  not  exploit  this  significantly 
towards  computational  gains. 

One  approach  that  addresses  these  problems  some¬ 
what  indirectly  is  conflict-directed  best  first  search  (CBFS) 
[Williams  and  Nayak,  1996],  It  is  based  on  the  idea  of  ex¬ 
amining  hypotheses  in  decreasing  order  of  their  prior  prob¬ 
abilities  and  using  a  truth  maintenance  system  (TMS)  to 
catch  minimal  conflicts  and  focus  the  search.  QCBFS  [Ku¬ 
mar,  2001]  is  an  extension  of  CBFS  that  leverages  qualitative 
knowledge  present  in  the  system.  Because  hypotheses  are  ex¬ 
amined  in  order  of  their  probabilities,  diagnoses  that  entail  a 
nominal  behavior  for  all  but  a  few  components  are  caught  as 
soon  as  possible  (unlike  in  the  naive  m-ordering  case). 

A  TMS  incorporates  and  uses  the  following  properties:  (1) 
If  a  partial  assignment  to  the  mode  behaviors  of  a  subset  of 
the  components  is  inconsistent,  then  any  other  assignment 
that  contains  this  subset  unchanged  is  also  inconsistent.  (2) 
Smaller  conflicts  result  in  more  pruning  of  the  search  space 
and  therefore,  whenever  an  assignment  A  is  infeasible,  a  min¬ 
imal  infeasible  subset  of  A  is  returned  (using  dependency 
tracking).  (3)  Since  the  hypotheses  that  we  examine  differ 
only  incrementally  from  one  another  in  the  assignments  for 
behavior  modes  of  components,  feasibility  checks  are  made 
more  efficient  (like  in  ITMS  [Nayak  and  Williams,  1997]). 
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Figure  1:  (a)  Shows  the  worst-case  scenario  for  m-ordering. 
(b)  Shows  the  worst-case  scenario  for  CBFS. 

3.1  Comparison  of  naive  m-ordering  and  CBFS 

While  naive  m-ordering  exploits  the  structure  of  the  under¬ 
lying  constraint  network,  it  does  not  exploit  the  fact  that  we 
are  interested  in  an  assignment  only  to  the  components  of  the 
system  (and  not  the  intermediate  variables).  This  becomes 
a  liability  especially  when  consistency  checks  involve  “sim¬ 
ulation”  and  are  therefore  costly.  It  performs  badly  when  a 
“small”  number  of  components  are  “tightly”  connected.  Fig¬ 
ure  1(a)  illustrates  the  bad  behavior  of  m-ordering.  There 
are  4  components  that  can  possibly  behave  in  different  modes 
(Cl,  C2,  C3  and  C4).  FI,  F2  and  F3  are  not  modeled  as  com¬ 
ponents  but  are  some  complex  mappings  (involving  simula¬ 
tion)  from  their  inputs  to  outputs.  The  number  of  parents  of 
C4  is  equal  to  6  and  the  combinatorial  optimization  problem 
is  exponential  in  this  quantity  [Darwiche,  1998],  A  TMS- 
based  algorithm  however,  would  require  only  a  search  space 
exponential  in  the  number  of  components  (=4).  This  can  be 
verified  by  noting  that  once  a  set  of  modes  is  assumed  for  each 
component  (as  in  a  TMS-based  algorithm),  verifying  that  the 
current  set  of  inputs  lead  to  the  observations  is  not  exponen¬ 
tial  but  only  polynomial  in  the  size  of  the  graph.  This  is  be¬ 
cause  any  component  maps  its  inputs  to  a  unique  output  and 
we  just  need  to  follow  the  inputs  through  all  the  transforma¬ 
tions  defined  by  the  components  to  eventually  verify  whether 
there  is  a  match  with  the  observations.  In  the  case  of  naive  in- 
ordering  however,  combinatorial  optimization  requires  us  to 
compute  and  store  against  all  values  of  communication  vari¬ 
ables  around  a  family  (also  called  partition),  the  most  likely 
modes  of  behavior  of  the  components  in  it.  This  makes  it  ex¬ 
ponential  in  the  induced  width  of  the  graph.  It  is  also  easy  to 
see  (as  claimed  earlier)  that  when  the  diagnosis  is  quite  close 
to  the  nominal  behaviors  of  components,  there  is  no  obvious 
way  of  exploiting  it  with  m-ordering. 

CBFS  on  the  other  hand,  exploits  the  fact  that  we  are  inter¬ 
ested  in  an  assignment  only  for  the  components  of  the  system, 
but  does  not  exploit  the  structure  of  the  physical  setting  effi¬ 
ciently.  The  only  indirect  way  in  which  the  structure  comes 
into  play  is  in  the  TMS  implementation  of  /  to  catch  min¬ 
imal  conflicts.  The  problem  with  CBFS  is  in  large  due  to 
the  fact  that  all  inconsistencies  are  traced  back  to  the  compo¬ 
nents.  This  makes  CBFS  perform  sub-optimally  when  com¬ 
ponents  are  “loosely”  connected.  Figure  1(b)  illustrates  the 
bad  behavior  of  CBFS.  An  observation  of  O  =  1  when  C7  is 
an  XOR  gate  entails  the  conflicts  {T1  =  1,T2  =  1}  and 
{T 1  =  0,T2  =  0}.  Note  that  T1  =  0,  T2  =0,  T1  =  1  or 
T2  =  1  are  not  conflicts  by  themselves.  If  all  inconsisten¬ 
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Figure  2:  (A)  The  physical  setting.  (B)  The  graph  represen¬ 
tation.  (C)  The  constraint  network.  (D)  The  T-Graph. 

cies  are  traced  back  to  the  components  Cl  -  C6  however,  the 
search  space  over  component  behavior  modes  is  never  pruned 
by  a  minimal  conflict  of  size  lesser  than  6.  If  on  the  other 
hand,  we  split  the  problem  into  two  (by  treating  the  cases 
{T1  =  1,T2  =  0}  and  {T1  =  0,T2  =  1}  separately)  the 
search  space  can  be  reduced  to  being  exponential  in  4  vari¬ 
ables  (rather  than  6). 

4  Hierarchical  Conflict-Directed  Best  First 
Search  (HCBFS) 

Before  we  describe  HCBFS  as  an  algorithm  that  can  combine 
the  best  of  both  the  above  approaches,  we  define  the  follow¬ 
ing  notions  related  to  the  structure  of  a  physical  setting. 
Definition  (Structural  Parameter  Set):  The  structural  pa¬ 
rameter  set  S'  of  a  physical  system  is  the  4-tuple  S  = 
( COMPS,I,0,T ).  Here,  I  is  the  set  of  external  inputs, 
O  is  the  set  of  output  variables  under  observation,  and  T  is 
the  set  of  intermediate  variables  in  the  system  which  are  not 
under  observation. 

Definition  (Graph  Representation):  The  graph  representa¬ 
tion  of  a  physical  system  with  structural  parameter  set  S  and 
a  topology  characterized  by  SDt  is  a  graph  with  nodes  corre¬ 
sponding  to  elements  in  S  and  undirected  edges  correspond¬ 
ing  to  physical  connections  inferred  from  SDt- 
Definition  (*-node):  A  node  in  the  graph  representation  of  a 
physical  system  is  a  c-node,  i-node ,  o-node  or  a  t-node  when 
it  corresponds  respectively  to  a  component,  input  variable, 
output  variable  or  an  intermediate  variable. 

Definition  (T-Graph):  The  T-Graph  of  a  physical  system  with 
structural  parameter  set  S  and  topology  SDt  is  a  graph  built 
out  of  removing  the  c-nodes  from  its  graph  representation  and 
directly  connecting  the  inputs  to  their  outputs  (in  that  direc¬ 
tion). 

Figure  2  illustrates  the  above  definitions  for  a  simple  physi¬ 
cal  setting.  Note  that  the  graph  representation  is  not  the  same 
as  the  constraint  network  specified  by  SD.  While  the  con¬ 
straint  network  is  built  on  the  variables  of  the  system  (ex¬ 
cluding  components)  using  SD ,  the  graph  representation  is 
built  only  out  of  SDt  (and  includes  the  components).  The 
T-Graph  represents  the  causal  relationships  among  the  vari¬ 
ables  (excluding  the  components)  and  it  can  be  observed  that 
the  constraint  network  is  equivalent  to  the  T-Graph  moralized 
by  making  a  clique  out  of  all  the  parents  of  any  node  [Dechter, 
1992], 

Notation:  Let  M  ( i )  be  the  set  of  modes  in  which  component 
compi  can  behave.  Let  c,  be  the  cardinality  of  this  set.  Let 
T(i )  be  the  set  of  values  an  intermediate  variable  t-nodei  can 


take.  Let  ij  be  the  cardinality  of  this  set. 

Definition  (c-size):  The  c-size  of  a  sub-graph  G  is  the  product 
of  the  number  of  modes  in  which  each  component  it  contains 
can  behave,  =  B.i<ECOMPS(G)ci- 

Definition  (t-partition):  A  t-partition  of  a  graph  representa¬ 
tion  is  any  collection  of  vertex  induced  sub-graphs  Si  ■■■  Sk 
such  that  for  all  i,  j  with  1  <  i,  j  <  k,  Si  PI  Sk  C  T. 
Definition  (t-size):  The  t-size  of  a  sub-graph  in  a  t-partition 
of  the  original  graph  is  the  product  of  the  number  of  dif¬ 
ferent  values  each  of  the  t-nodes  it  shares  with  other  sub¬ 
graphs,  can  take.  In  other  words,  suppose  Si  •  ■  ■  Sk  form  a 
t-partition  of  the  original  graph.  Denote  the  t-nodes  in  each 
of  these  sub-graphs  by  STi  ■  ■  ■  ST/-.  The  t-size  of  S,-  is  given 
by  UjesTi(tj\3h,  1  <  h  <  k,  h  ±  i,  j  €  STh). 

Definition  (ct-size):  The  ct-size  of  a  graph  is  the  product  of 
its  c-size  and  t-size. 

Given  the  graph  representation  of  a  physical  system, 
its  c-size  characterizes  the  size  of  the  search  space  for 
CBFS.  The  general  idea  behind  HCBFS  is  to  reduce  the 
effective  search  space  of  CBFS  using  dynamic  program¬ 
ming.  Suppose  we  were  able  to  divide  the  system  into  two 
subsystems  that  had  components  comp ^  ■  •  •  compini  and 
campj1  ■  ■  ■  compjrio  such  that  m  +  n2  =  \COMPS\.  Now, 
the  search  space  for  each  of  these  two  individual  partitions 
(for  CBFS)  becomes  their  respective  c-sizes.  Calling  them 
Ci  and  C'-2  respectively,  we  have  C1.C2  =  C  (C  is  the  c-size 
of  the  original  graph).  Of  course,  the  search  cannot  simply 
be  done  in  each  of  them  independently  because  of  the  com¬ 
mon  variables  they  share.  However,  we  can  apply  the  idea 
of  dynamic  programming  to  solve  each  of  these  partitions 
for  all  values  of  the  variables  they  share  and  then  “join”  the 
two  results.  If  we  allow  for  the  common  variables  to  be  only 
among  the  t-nodes,  then  the  size  of  the  search  space  becomes 
CjT  +  C2T  +  T 2  (T  is  the  t-size  of  the  common  t-nodes). 
C\  T  +  C2T  accounts  for  solving  the  sub-problems  for  all 
values  of  the  communication  variables,  and  T2  accounts  for 
“joining”  them.  It  should  be  noted  however,  that  if  consis¬ 
tency  checks  involve  “simulation”,  then  the  T2  term  tends  to 
be  negligible  (because  search  over  the  join-space  does  not  in¬ 
volve  simulation).  Generalizing  the  above  idea  of  dynamic 
programming,  it  is  also  possible  to  characterize  n-way  splits 
which  partition  the  original  graph  into  n  partitions  each  of 
which  share  communication  variables  with  a  subset  of  the 
others. 

Definition  (Splitting  Condition):  The  splitting  condition 
holds  for  a  t-partition  in  a  graph  G  if  the  sum  of  the  ct-sizes  of 
the  partitions  and  the  join-size  is  strictly  lesser  than  the  c-size 
of  G. 

To  obtain  maximum  computational  benefits,  we  have  to 
find  a  t-partition  that  minimizes  the  sum  of  the  ct-sizes  of 
the  resulting  partitions  and  the  join-size.  This  general  n-way 
split  is  NP-hard  to  find  (easy  to  prove  from  the  fact  that  find¬ 
ing  the  induced  width  is  NP-hard).  However,  HCBFS  em¬ 
ploys  a  heuristic  to  decompose  a  large  diagnosis  problem 
into  optimal  sub-problems  based  on  the  topological  struc¬ 
ture  of  the  system.  It  runs  in  polynomial  time  and  is  al¬ 
ways  assured  of  yielding  computational  benefits  (albeit  in 
sub-optimal  amounts).  The  idea  is  to  examine  only  a  poly¬ 
nomial  number  of  2-way  splits  and  choose  the  greediest  one 


ALGORITHM  HCBFS  (Graph  G  =  ( V ,  E)) 

T  =  T-Graph  of  G 

T'  =  Partition-Tree  formed  by  m-ordering 
on  moralized  T 
E  =  Edges  of  T1 
GREEDYSPLIT  (G,  E) 

END  HCBFS 

ALGORITHM  GREEDYSPLIT  (Graph  G, 

Candidate-Splits  B) 
bk  =  BEST-SPLIT  (G,  B) 

IF  (SPLITTING-CONDITION  (G,  bk))  THEN 
(Gj,  G2)  =  PARTITION  (G,  bk) 

B\  =  {bi\bi  is  on  the  same  side  of  bk  as  Gi} 
B2  =  {bi\bi  is  on  the  same  side  of  bk  as  G2} 
GREEDYSPLIT  (Gi ,  Bi ) 

GREEDYSPLIT  (G2,  B2) 

END  IF 

END  GREEDYSPLIT 


Figure  3:  Hierarchical  Conflict-Directed  Best  First  Search 


Figure  4:  Illustrates  the  working  of  HCBFS  to  produce  sub¬ 
problems.  Thicker  edges  denote  greater  communication  (t- 
size).  PI,  P2,  P3  are  the  final  partitions.  The  tables  indicate 
the  solutions  to  diagnosis  sub-problems  for  all  values  of  the 
surrounding  communication  variables. 


if  it  satisfies  the  splitting  condition.  Such  a  splitting  process 
is  performed  recursively  until  there  is  no  more  apparent  scope 
for  computational  benefits.  Interestingly  enough,  the  candi¬ 
date  t-partitions  that  are  examined  are  themselves  derived  us¬ 
ing  the  m-ordering  heuristics.  Figure  3  presents  the  working 
of  HCBFS;  and  Figure  4  illustrates  its  working  on  a  small 
example.  The  following  properties  hold  true  for  the  HCBFS 
algorithm. 

Property  1:  The  edges  of  T1  maintain  the  running  intersec¬ 
tion  property  [Dechter,  1992]  and  hence  the  t-nodes  consti¬ 
tuting  the  communication  variables  on  any  edge  form  a  valid 
t-partition. 

Property  2:  Let  the  c-sizes  of  the  final  partitions  be 
Ci  ■  ■  ■  Ck-  The  c-size  of  the  original  graph  is  therefore 
n^zfCj.  The  first  time  we  partition  G,  it  must  have  been 
the  case  that  (because  of  the  splitting  having  to  be  satisfied) 
n >  S  x  T  +  T  x  R  (T  is  the  size  of  the  communi¬ 
cation;  S  and  R  are  the  c-sizes  of  the  two  resulting  partitions 
with  5  x  R  =  Il’zfCV).  In  later  iterations,  the  effective  S  and 
R  are  only  made  to  decrease  recursively  and  this  essentially 
means  that  HCBFS  is  always  safe  in  producing  computational 
benefits. 

Property  3:  The  total  number  of  splits  considered  is  clearly 
linear  since  they  correspond  to  the  edges  of  T' .  Although 
there  are  two  recursive  calls  to  GREEDYSPLIT,  the  can¬ 
didate  set  of  edges  that  enter  them  are  disjoint  and  hence 
GREEDYSPLIT  is  called  only  a  linear  number  of  times.  This 
proves  that  the  running  time  of  HCBFS  is  polynomial. 
Property  4:  Choosing  certain  edges  in  a  tree  as  splits  results 
in  a  set  of  partitions  that  themselves  form  a  tree  with  respect 
to  the  split  edges  (as  illustrated  in  Figure  4).  Since  we  know 
that  optimization  in  a  tree  structured  network  is  exponential 
in  the  ct-size  of  the  largest  partition,  the  complexity  of  diag¬ 
nosis  using  HCBFS  is  exponential  in  this  parameter. 

4.1  Analysis  and  Implications  of  HCBFS 

We  briefly  delve  into  the  computational  implications  of 
HCBFS.  HCBFS  facilitates  search  in  two  ways.  First,  it  re¬ 
duces  the  effective  search  space  by  using  the  dynamic  pro¬ 
gramming  paradigm.  Second,  it  propagates  “easiness”  in  con¬ 
straint  checking.  Constraint  checking  in  general  may  not  be 
computationally  straightforward  —  it  may  often  involve  sys¬ 
tem  “simulation”  of  some  kind  over  an  extended  period  of 
time.  It  can  be  noticed  however,  that  constraint  checking  over 
the  join  space  is  a  mere  verification  that  two  selected  rows  of 
the  partition  tables  have  similar  values  for  their  communica¬ 
tion  variables.  By  using  HCBFS,  the  simulation-based  con¬ 
straint  checks  are  “pushed”  to  smaller  parts  of  the  system  (the 
partitions).  Even  for  consistency  checks  that  do  not  involve 
“simulation”,  implementing  a  TMS  for  each  small  partition 
is  more  effective  (in  terms  of  the  complexity  of  data  struc¬ 
tures  to  be  maintained)  than  one  large  TMS  for  the  system  as 
a  whole. 

HCBFS  not  only  reduces  the  effective  un-amortized  search 
complexity  for  a  diagnosis  call,  but  also  reduces  the  amor¬ 
tized  complexity.  The  solutions  to  sub-problems  occurring 
for  diagnosis  calls  made  in  the  past  can  be  stored  and  used  for 
future  diagnosis  calls  when  they  need  to  solve  the  same  sub¬ 
problems.  Eventually,  when  all  sub-problems  for  all  values 


of  communication  variables  have  been  solved  at  least  once,  a 
diagnosis  call  can  be  answered  by  doing  a  search  only  over 
the  join-space  of  the  partitions.  This  too  (as  argued  before)  is 
computationally  easier  than  “simulation”. 

The  dynamic  programming  idea  of  HCBFS  can  further  be 
used  to  pre-process  or  compile  the  system  description  to  fa¬ 
cilitate  diagnosis.  Consider  a  partition  of  the  graph  represen¬ 
tation  of  a  physical  setting.  The  idea  is  to  solve  the  diagnosis 
problem  for  this  partition  for  all  values  of  the  surrounding  in¬ 
termediate  variables  ( t-nodes )  and  store  the  results.  We  can 
then  treat  this  partition  as  a  single  physical  component  that 
can  take  any  value  (mode)  corresponding  to  a  combination 
of  the  values  for  each  of  its  surrounding  t-nodes.  The  as¬ 
sociated  probabilities  would  be  derived  from  the  results  for 
the  corresponding  diagnosis  sub-problems.  This  kind  of  pre¬ 
compilation  of  the  system  to  treat  partitions  as  components 
provides  computational  benefits  only  if  their  t-size  is  lesser 
than  their  c-size  (which  is  often  the  case). 

The  space  complexity  associated  with  HCBFS  has  two 
components.  One  is  the  size  of  the  tables  associated  with 
the  sub-problems.  This  is  referred  to  as  the  table-space  com¬ 
plexity.  It  is  easy  to  observe  that  the  table  space  complexity 
is  equal  to  the  sum  of  the  t-sizes  over  all  partitions.  Another 
component  of  the  space  requirement  is  the  actual  space  re¬ 
quired  for  the  diagnosis  algorithms  to  build  the  tables  and 
compose  them  to  answer  a  diagnosis  call.  This  space  require¬ 
ment  is  identical  to  the  running  time  complexities  associated 
with  solving  and  composing  sub-problems.  It  is  worth  not¬ 
ing  that  the  cost  of  implementing  dynamic  programming  in 
HCBFS  is  reflected  only  in  its  table-space  complexity. 

HCBFS  also  leads  to  what  are  called  hybrid  approaches. 
These  are  techniques  that  combine  conflict-based  and 
coverage-based  approaches  [Kumar,  to  appear]  to  solve  sub¬ 
problems  and  combine  their  solutions.  Coverage-based  algo¬ 
rithms  are  those  that  record  conflicts  and  cast  the  diagnosis 
problem  as  a  minimum  weight  hitting  set  problem  [Kurien 
and  Nayak,  2000].  Conflict-based  approaches  refer  to  the 
standard  TMS-based  algorithms  like  CBFS  and  QCBFS.  In 
general,  hybrid  approaches  do  the  following:  (1)  Employ 
the  hierarchical  partitioning  algorithm  to  reduce  the  effective 
search  space.  (2)  Employ  one  of  coverage-based  or  conflict- 
based  approaches  for  the  sub-problems  and  the  join  space. 

5  Comparison  with  Related  Work 

Related  work  on  trying  to  leverage  structure  into  the  task  of 
diagnosis  can  be  found  in  [Darwiche,  1998],  [Autio  and  Re¬ 
iter,  1998],  [Provan,  2001]  etc.  In  [Darwiche,  1998],  negation 
normal  forms  (NNF)  are  used  to  represent  the  consequence 
of  SD  U  ODS.  Subsequently,  minimal  cardinality  diagnoses 
are  extracted  from  them  using  a  simple  cost  propagation  and 
pruning  algorithm.  For  such  a  procedure  to  be  effective,  it  is 
important  to  ensure  the  decomposability  of  the  NNF.  Decom- 
posability  is  achieved  by  partitioning  SD  to  perform  a  case 
analysis  on  the  shared  atoms  that  do  not  appear  among  the 
observations.  The  partitioning  choices  are  inspired  by  trying 
to  produce  a  join-tree  of  the  topological  structure  of  the  sys¬ 
tem  much  like  the  m-ordering  heuristics.  The  complexity  of 
the  algorithm  is  exponential  in  the  size  of  the  hyper-nodes  of 


the  join  tree  and  linear  in  the  number  of  such  hyper-nodes. 

There  are  at  least  three  important  ways  in  which  this  ap¬ 
proach  differs  from  ours.  Firstly,  this  approach  does  not  rea¬ 
son  about  probabilities  but  rather  looks  for  minimal  diagnoses 
(minimizes  the  number  of  faulty  components).  Secondly  (and 
more  importantly),  it  tries  to  produce  diagnoses  (minimal)  by 
maintaining  at  each  stage,  a  representation  for  all  the  consis¬ 
tent  candidates.  The  optimization  phase  (of  producing  mini¬ 
mal  candidates)  occurs  as  a  separate  phase.  Usually,  we  are 
not  interested  in  all  consistent  diagnoses  and  trying  to  rep¬ 
resent  them  at  any  stage  when  there  could  potentially  be  an 
exponentially  large  number  of  them  can  be  a  bottleneck.  In 
our  approach,  the  optimization  and  satisfaction  phases  are  in¬ 
terleaved.  This  allows  us  to  produce  candidates  as  and  when 
we  want  them,  in  decreasing  order  of  their  optimization  val¬ 
ues,  and  to  prune  the  search  space  using  both  optimality-  and 
satisfiability-reasoning.  Thirdly,  if  the  number  of  intermedi¬ 
ate  variables  is  too  many,  achieving  decomposability  in  the 
NNF  is  exponential  in  the  induced  width  of  the  moralized 
T-Graph;  but  since  we  are  interested  only  in  the  behavior 
modes  of  components  and  not  that  of  intermediate  variables, 
the  search  space  may  be  significantly  reduced  using  our  ap¬ 
proach  when  the  components  are  “tightly”  coupled. 

In  [Provan,  2001]  the  idea  of  hierarchical  diagnosis  has  a 
different  meaning.  It  is  based  on  the  use  of  abstraction  oper¬ 
ators  to  define  an  abstraction  hierarchy  of  the  model  (a  lattice 
induced  by  a  set  of  partitions  of  the  system  variables).  A 
group  of  components  and  intermediate  variables  at  a  partic¬ 
ular  abstraction  level  are  “merged”  to  form  “abstract”  com¬ 
ponents  at  a  higher  level  with  appropriately  defined  inputs, 
outputs  and  constraints  relating  them.  A  structural  abstrac¬ 
tion  sc  of  subcomponents  ci  •  •  •  Ck  defines  two  modes  of  be¬ 
havior  for  sc  —  AB(sc)  and  -> AB(sc)  with  the  constraint 
that  -iAB(ci)  •  •  •  -1 AB(ck )  -A  -i  AB(sc).  Such  an  abstrac¬ 
tion  mechanism  is  useful  only  for  isolating  a  group  of  compo¬ 
nents  all  of  which  cannot  be  behaving  in  their  nominal  modes 
(abstract  models  isolate  diagnoses  only  at  the  abstract  level, 
but  more  efficiently).  At  each  level  of  abstraction  we  only 
define  the  nominal  mode  of  behavior  for  the  abstract  compo¬ 
nent.  The  only  other  implicit  mode  is  the  faulty  mode.  This 
limits  the  scope  of  diagnosis  even  at  the  abstract  levels.  Un¬ 
der  a  combinatorial  optimization  formulation  of  the  diagnosis 
problem,  abstraction  of  ci  ■  ■  ■  c*  to  sc  only  defines  what  hap¬ 
pens  when  all  components  ci  •  •  •  Ck  are  behaving  in  their  most 
probable  modes  (nominal  mode  for  sc).  It  does  not  say  any¬ 
thing  about  what  probabilities  are  associated  or  what  happens 
with  any  of  the  other  remaining  exponentially  large  number 
of  non-nominal  modes.  This  makes  diagnosis  not  only  in¬ 
feasible  at  more  detailed  levels,  but  also  information-lossy  at 
abstract  levels. 

6  Conclusions 

In  this  paper,  we  employed  the  combinatorial  optimization 
characterization  of  the  diagnosis  problem.  We  compared 
two  different  approaches  that  exploit  different  features  of 
the  problem:  (1)  naive  m-ordering  exploits  the  structure  of 
the  system  by  leveraging  the  causal  dependencies  among  the 
variables  ( T-Graph )  (2)  CBFS  exploits  the  fact  that  the  out¬ 


put  is  uniquely  determined  for  given  inputs  to  a  component 
behaving  in  a  known  mode,  and  that  we  are  interested  only 
in  an  assignment  to  the  component  modes  of  the  system.  We 
observed  that  naive  m-ordering  performs  poorly  when  there  is 
high  interconnectedness  among  components  and  that  CBFS 
performs  poorly  when  there  is  low  coupling.  We  proposed  a 
computationally  feasible  algorithm  called  HCBFS  (extending 
on  CBFS)  to  achieve  the  best  of  both  the  worlds.  HCBFS  uses 
CBFS  in  tightly  coupled  parts  of  the  system  and  m-ordering 
to  identify  them.  We  showed  that  HCBFS  has  many  important 
implications  on  the  complexity  of  diagnosis  —  reduces  the 
un-amortized  complexity  of  a  diagnosis  call,  reduces  amor¬ 
tized  complexity  of  a  diagnosis  call  by  reusing  computation 
done  for  sub-problems  arising  in  past  diagnosis  calls,  allows 
pre-compilation  of  the  system  description  to  facilitate  diag¬ 
nosis,  and  enhances  hybrid  algorithms.  Finally,  we  compared 
and  contrasted  our  work  with  somewhat  related  approaches. 
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Abstract.  Consistency-based  diagnosis  is  the  most  widely  used  ap¬ 
proach  to  model-based  diagnosis  within  the  Artificial  Intelligence 
community.  It  is  usually  carried  out  through  an  iterative  cycle  of  be¬ 
havior  prediction,  conflict  detection,  and  candidate  generation  and 
refinement.  Many  approaches  to  consistency-based  diagnosis  have 
relied  on  some  kind  of  on-line  dependency-recording  mechanism  for 
conflict  calculation.  These  techniques  have  had  different  problems, 
specially  when  applied  to  dynamic  systems.  Recently,  off-line  com¬ 
pilation  of  dependencies  has  been  established  as  a  suitable  alternative 
approach.  In  this  work  we  compare  one  compilation  technique,  based 
on  the  possible  conflict  concept,  with  results  obtained  with  the  clas¬ 
sical  on-line  dependency  recording  engine  as  in  GDE.  Moreover,  we 
compare  possible  conflicts  with  another  compilation  technique  com¬ 
ing  from  the  FDI  community,  which  is  based  on  analytical  redun¬ 
dancy  relations.  Finally,  we  study  the  relationship  between  possible 
conflicts,  analytical  redundancy  relations,  and  conflicts. 

1  Introduction 

For  more  than  thirty  years  different  techniques  have  been  applied 
to  diagnose  systems  in  multiple  domains.  Diagnosis  has  been  carried 
out  through  knowledge-based  systems,  case-based  reasoning,  model- 
based  reasoning,  and  so  on.  This  work  is  focused  in  the  model-based 
approach  to  diagnosis.  Moreover,  we  will  only  talk  about  diagnosis 
of  physical  devices  [18], 

More  specifically,  consistency-based  diagnosis  is  the  most  widely 
used  approach  to  model-based  diagnosis  within  the  Artificial  Intelli¬ 
gence  community  (usually  known  as  DX).  It  is  a  research  field  that 
has  reported  successful  results  in  recent  years  [39.  7],  This  approach 
has  proven  its  maturity,  both  in  theory,  and  in  practice.  On  the  one 
hand,  the  diagnosis  process  and  the  diagnosis  results  have  been  com¬ 
pletely  characterized  from  a  logical  point  of  view  [32,  12],  thus  fa¬ 
cilitating  further  comparison.  On  the  other  hand,  consistency-based 
diagnosis  has  been  successfully  applied  to  a  wide  variety  of  domains 
such  as  automotive  industry  [3,  38],  bio-medicine  [20],  nuclear  plants 
[24],  or  ecology  [37], 

In  such  a  framework,  GDE  [13]  is  the  most  well  known  imple¬ 
mentation,  and  de  facto  paradigm.  GDE  organizes  the  diagnosis  pro¬ 
cess  as  an  iterative  cycle  of  behavior  prediction,  conflict  detection, 
and  candidate  generation  and  refinement.  But  conflict  computation 
is  a  non-trivial  step,  which  has  deserved  a  lot  of  attention  from  the 
consistency-based  diagnosis  community.  In  GDE,  the  set  of  mini¬ 
mal  conflicts  is  computed  by  means  of  an  ATMS  [11],  which  records 
on-line  the  set  of  correctness  assumptions,  or  dependencies,  used  by 
the  inference  engine.  It  should  be  noticed  that  dependency-recording 


can  be  done  forward  (whenever  new  input  data  are  introduced),  or 
backward  (when  a  discrepancy  is  found,  such  as  in  CAEN  [2,  21], 
DYNAMIS  [6],  or  TRANSCEND  [25]).  Another  important  feature 
of  the  GDE  framework  is  that  it  calculates  labels  propagating  values 
through  constraints  in  every  possible  direction. 

However,  one  problem  related  to  on-line  dependency-recording  is 
that  the  set  of  labels  needs  to  be  computed  each  time  a  new  different 
value  is  introduced.  Another  problem  was  found  in  the  combined  use 
of  on-line  dependency-recording  together  with  qualitative  models  for 
diagnosing  dynamic  systems  [17,  14],  Mainly  for  these  reasons  sev¬ 
eral  research  groups  have  looked  for  alternative  methods  to  such  a 
kind  of  on-line  dependency-recording.  On  the  one  hand  state-based 
diagnosis  [36]  has  emerged  as  an  alternative  to  simulation-based  di¬ 
agnosis,  just  for  qualitative  models.  On  the  other  hand,  topological 
methods  propose  to  explicitly  use  the  structural  description  of  the 
system  to  be  diagnosed.  This  information  is  implicitly  stated  in  the 
system  description.  Within  this  last  approach,  we  make  difference  of 
two  major  trends:  those  methods  that  use  other  on-line  dependency¬ 
recording  than  ATMS  (by  exploring  causal-graphs  [2,  24],  signed 
directed  graphs  [26],  or  other  topological  and  functional  structures 
[5]),  and  those  methods  that  perform  off-line  dependency-recording. 

Last  techniques  are  also  known  as  compilation  methods  within  the 
DX  community.  The  main  idea  supporting  this  approach  is  that  re¬ 
dundancy  within  the  models  can  be  found  off-line.  A  similar  idea 
was  used  in  the  Control  Engineering  community  (or  FDI),  where 
Staroswiecki  and  Declerk  proposed  to  use  Analytical  Redundancy 
Relations  (ARRs  for  short),  for  fault  detection  and  localization  [34]. 
Given  such  a  similarity,  there  is  an  ongoing  interest  from  the  DX  and 
the  FDI  communities  in  comparing  their  approaches. 

Between  the  FDI  and  AI  proposals,  Lunze  and  Schiller  [23]  were 
able  to  perform  diagnosis  using  causal  graphs  associated  with  over¬ 
constrained  systems.  These  systems  were  obtained  from  the  logical 
formula  in  the  models  of  the  system. 

Within  the  DX  community  we  have  found  the  following  compila¬ 
tion  techniques: 

•  Darwiche  and  Provan  [10]  characterized  the  set  of  diagnoses  using 
the  consequence  concept  [9],  instead  of  using  the  conflict  concept. 
Analyzing  the  system  structure,  those  sub-systems  which  could 
lead  to  a  diagnosis  can  be  found  off-line. 

•  Similar  information  is  used  by  Steele  and  Leitch  [35]  to  refine  the 
set  of  candidates,  in  an  adaptive  approach  to  diagnosis  [4], 

•  In  DOGS,  Loiez  and  Taillibert  [22]  proposed  to  localize,  off-line, 
over-constrained  sets  of  equations.  They  were  looking  for  those 
sub-systems  capable  to  become  conflicts.  The  work  done  is  con- 


ceptually  equivalent  to  that  in  [34],  as  it  has  been  stated  in  [8], 

•  Frohlich  and  Nejdl  [15]  used  structural  information  two-fold:  they 
analyzed  the  whole  set  of  logical  formula  in  the  model  to  find  sub¬ 
sets  of  formula  capable  to  generate  diagnosis,  and  they  benefit 
from  these  sub-sets  in  order  to  refine  the  whole  set  of  diagnosis 
candidates. 

•  Pulido  and  Alonso  [27,  28]  proposed  to  organize  consistency- 
based  diagnosis  around  the  possible  conflict  concept.  A  possible 
conflict  is  a  sub-system  in  system  description  which  is  capable  to 
become  a  conflict,  within  the  GDE  framework. 

In  this  work  we  revisit  the  compilation  technique  based  on  the  pos¬ 
sible  conflict  concept  [27,  28],  Initially  we  summarize  the  character¬ 
ization  of  that  concept,  in  order  to  compare  possible  conflicts  against 
real  conflicts.  Later  on,  we  establish  the  relationship  between  pos¬ 
sible  conflicts  and  ARRs.  Finally,  we  revisit  the  work  by  Cordier  et 
al.  [8]  in  order  to  compare  conflicts  and  ARRs  from  a  computational 
point  of  view. 

Due  to  space  limitations  we  do  not  compare  possible  conflicts  and 
other  compilation  techniques  from  the  DX  community.  Such  a  com¬ 
parison  can  be  found  in  [28,  30], 


2  The  possible  conflict  concept 

Main  assumptions  in  this  work  are  that  there  is  no  structural  fault, 
and  it  is  possible  to  know  beforehand  the  number  and  placement  of 
available  observations  (sensors).  An  additional  assumption  is  that  the 
model  of  the  system  can  be  expressed  as  a  set  of  constraints:  quanti¬ 
tative  or  qualitative,  linear  or  not,  algebraic  or  not. 

In  Reiter’s  framework  for  model-based  diagnosis  [32]  a  minimal 
conflict  identifies  a  set  of  constraints  containing  enough  redundancy 
to  perform  diagnosis.  In  the  most  simple  case,  when  constraints  are 
made  up  of  equations,  a  minimal  conflict  would  denote  a  strictly 
over-determined  system1 . 

As  it  was  mentioned  in  the  previous  Section,  shared  basis  in  com¬ 
pilation  techniques  is:  the  set  of  analytically  redundant  sub-systems, 
which  can  be  used  for  diagnosis  purposes,  can  be  computed  off-line. 
Moreover,  it  has  been  proven  that  GDE  provides  all  the  existing 
minimal  conflicts.  Since  the  set  of  possible  conflicts  tries  to  be  a 
computational  alternative  to  on-line  dependency  recording  for  con¬ 
flict  computation,  we  have  imposed  an  additional  requirement:  over¬ 
constrained  sub-systems  should  be  the  same  as  the  set  of  minimal 
conflicts  computed  by  GDE  2. 

Finding  analytical  redundancy  is  a  necessary  but  not  a  sufficient 
condition  for  a  system  to  be  suitable  for  consistency-based  diagnosis 
purposes.  The  system  must  also  be  solved  using  local  propagation 
alone3 .  To  fulfill  both  requirements  we  have  split  the  search  process 
into  two  phases.  First,  we  look  for  over-determined  systems.  Second, 
we  check  whether  these  systems  can  be  solved  using  local  propaga¬ 
tion  alone.  To  do  so,  we  just  need  abstractions  of  model-description. 
For  the  sake  of  readability,  below  we  include  a  summary  of  defini¬ 
tions  the  reader  can  find  in  [27,  28]. 


1  In  an  over-determined  system  the  number  of  equations,  e,  is  greater  than  the 
number  of  unknowns,  u:  e  >  u  +  1.  In  a  strictly  over-determined  system, 
e  =  it  +  1. 

2  For  this  reason,  we  always  assume  that  we  have  the  same  model  (system 
description  or  SD  in  Reiter’s  terminology)  as  GDE  has. 

2  Current  consistency-based  diagnosis  systems  do  not  impose  that  constraint 
[19],  In  [30]  we  extended  the  possible  conflict  concept  to  deal  with  such 
(cyclical)  configurations. 


2.1  Searching  for  over-determined  systems 

We  have  represented  the  model  in  SD  as  a  hyper-graph:  Hsd  = 
{V,  R}  which  is  made  up  of: 

•  V  =  {v\,V2,  •  •  • ,  vn },  the  set  of  variables  in  the  model.  It  is  made 
up  of  observed  OBS,  and  not  observed  or  unknown  variables, 
NOBS :  V  =  OBS  [J  NOBS. 

•  R  =  {n,  T2, . .  - ,  rm}  is  a  family  of  sub-sets  in  V,  where  each  Tk 
represents  a  constraint  in  the  model,  and  it  contains  some  model 
variables,  observed  and  not  observed  ones. 

We  have  called  Evaluation  Chains  the  over-constrained  sub¬ 
systems  in  Hsd  (in  Appendix  A  the  reader  can  find  definitions  for 
terminology  in  graphs  and  hyper-graphs  c.f.  [16,  1]): 

Evaluation  chain:  Hec  C  Hsd  is  a  partial  sub-hypergraph  in 
Hsd  :  H  ec  —  {Vec,Rec},  where  Vec  C  V,  Rec  C  R,  and 
X  ec  —  Tec  ft  NOBS  is  the  set  of  unknowns  in  V ec.  and  Hec 
verifies: 

1 .  Hec  is  a  connected  hypergraph, 

2.  Vec  n  OBS  7^0, 

3.  Vi>  no  G  Xec  =7*  d}jec{vno )  A  2, 

4.  let  G(Hec)  be  a  bipartite  graph  made  up  of  two  kinds  of  nodes: 
x  £  Xec,  and  nec  £  Rec,  such  that  two  nodes  are  linked  in 
G(Hec)  if  and  only  if  x  £  rjec.  Then,  G(Hec )  has  a  matching 
with  maximal  cardinality  m'  =  |A'ec|  and  \Rec\  >  m'  +  1. 

Figure  1  shows  a  classical  example  in  consistency-based  diagno¬ 
sis.  In  order  to  make  difference  of  components  and  constraints,  we 
will  use  capital  letters  for  components,  and  small  letters  for  con¬ 
straints  in  their  models,  rm  and  tij  denote  the  models  of  multipliers 
and  adders,  respectively.  Each  model  is  made  up  of  just  one  con¬ 
straint;  for  instance,  m\  =  {A,  C,  A'}.  Whenever  a  model  has  more 
than  one  constraint,  indices  are  used  to  distinguish  them.  The  related 
hyper-graph  is 

Hpoiybox  =  {{A,  B,  C,  D,  E,  F,  G,  X,  Y,Z},  (mi,  m2,  m3,  oi,  a2}} 


Ml 


Figure  1.  Classical  polybox  example  in  the  consistency-based  diagnosis. 
Observed  values  are  in  brackets.  {X,  Y,  Z\  are  non-observed  values. 

Since  we  are  interested  in  minimal  conflicts,  only  minimal  evalu¬ 
ation  chains,  MEC  for  short,  are  useful. 

Minimal  Evaluation  Chain  :  Hec  is  a  minimal  evaluation  chain  if 
there  is  no  evaluation  chain  H'ec  C  Hec- 

The  set  of  minimal  Evaluation  chains,  SMEC,  is  built  based  on 
the  algorithms:  build-every-mec(),  build-mec(),  and  justijy()  which 
perform  depth-first  search  in  Hsd  using  backtracking.  All  these  al¬ 
gorithms  can  be  found  in  Appendix  B.  In  the  polybox  example,  these 


algorithms  have  found  three  MECs: 

Heci  =  {{A,B,C,D,F,X,Y},{m1,m2,a1}} 

HeC2  =  {{B,C,D,E,G,Y,Z},{m2,m3,a2}} 

HeC3  =  {{A,C,E,F,G,X,Y,Z},{m1,a1,a2,m3}} 

2.2  Can  an  evaluation  chain  be  solved? 

A  minimal  conflict  is  a  strictly  over-determined  system  that  we  want 
to  solve  using  local  propagation  alone.  However,  the  hyper-graph  has 
not  enough  information  about  how  each  constraint  can  be  solved.  To 
tackle  this  problem,  we  create  an  AND-OR  graph  for  each  minimal 
evaluation  chain.  In  such  a  graph,  there  is  one  or  more  AND-OR 
arcs  for  each  hyper-arc  in  the  MEC.  Each  AND-OR  arc  represents 
one  way  the  hyper-arc  could  be  solved.  In  fact,  to  solve  a  MEC,  we 
should  select  one  AND-OR  arc  from  each  constraint.  As  a  conse¬ 
quence.  choosing  different  AND-OR  arcs  from  the  AND-OR  graph 
generates  different  ways  of  solving  the  MEC.  Moreover,  the  over¬ 
determined  system  can  only  be  solved  using  local  propagation  cri¬ 
teria.  Each  one  of  the  different  ways  of  solving  a  MEC  is  called  a 
Minimal  Evaluation  Model,  or  MEM. 

For  instance,  each  constraint  (m;  or  a, )  used  to  model  the  poly¬ 
box  system  provides  three  different  interpretations  to  the  AND-OR 
graph: 

{Wlil  =  Vout  —  Vini  X  Vin2 

m.j2  =  Vini  —  Vout  /Vin2,  if  ttjn2  7“  h 
m-jg  =  Vin  2  —  Vout  / ,  if  Vin ^  7“  0 

Interpretations  for  a  constraint  are  usually  obtained  when  applying 
the  invertibility  criterion.  Nevertheless,  there  are  additional  criteria. 
Appendix  D  shows  constraints  used  to  model  a  physical  system  made 
up  of  tanks,  pumps  and  valves.  Constraints  trl3,t23,  tr23  are  used 
to  compute  the  mass  in  a  tank.  In  such  kind  of  constraint,  just  one  in¬ 
terpretation  is  allowed,  since  we  have  taken  an  integration  approach: 
)  =  J  m'T(t  —  l)dt  +  mj’(f  —  1) 

This  interpretation  can  not  be  reversed.  Hence,  additional  concepts 
are  necessary  to  define  a  Minimal  Evaluation  Model. 

Given  the  relation  between  riec  £  Rec,  and  the  set  of  AND-OR 
arcs  rl(.  ,  derived  from  riec,  we  can  state  the  following  proposition. 

Proposition  1  Let  AOG(Hec)  =  { Vem,  Rem}  be  the  AND-OR 
graph  obtained  from  Hec  —  {Vec,  Rec}  applying  the  local  reso¬ 
lution  criterion,  where: 

*  Tern  —  Vec, 

•  Vr;ec  G  Rec  =>  3 nkem  £  Rem,  k>  1 
Then,  riec  £  Rec  induces  a  partition  in  Rem- 

Proof:  Each  ncc  £  Rec  induces  an  equivalence  class  in  Rem- 
By  definition,  it  induces  a  partition  too. 

Leaf  node:  Vi  is  a  leaf  node  in  graph  H  iff  f"1  =  0. 

Discrepancy  node:  Vi  is  a  discrepancy  node  in  graph  H  iff 

•  >  2  A  Vi  £  NOBS),  or 

•  (djj(vi)  >  1  A  Vi  £  OBS) 

That  is,  a  leaf  node  has  no  predecessors,  and  a  discrepancy  node 
can  be  found  in  two  different  ways:  estimating  an  observed  variable, 
or  doing  a  double  estimation  for  an  unknown  variable. 

Minimal  Evaluation  Model:  A  partial  AND-OR  graph,  Hmern  C 
AOG(Hec),  where  Hmem  =  {Vmem,  Rmem},  is  a  minimal 
Evaluation  model  iff: 

1.  Rmem  is  a  minimal  hitting-set  for  the  partition  induced  by 

nBC  £  Rec  in  Rem, 


2.  (Vu;  |  Vi  £  Vmem  and  Vi  is  a  leaf  node)  =>  Vi  £  OBS, 

3.  3i Xj  £  Vmem  \  Xj  is  a  discrepancy  node, 

4.  if  Xj  is  a  discrepancy  node,  then  there  exists  a  directed  and 
acyclic  path  in  Hmem  :  {%i,Xi+ 1, . . . ,  Xi+k,  Xj}  from  each 
node  Xi  to  Xj. 

Algorithms  used  to  calculate  every  MEM  for  each  MEC:  build- 
every-mem(),  and  build-mem(),  are  given  in  Appendix  C.  These  al¬ 
gorithms  are  exhaustive  too,  since  they  perform  depth-first  search 
using  backtracking.  For  instance,  MEC  Heci  has  a  related  AND-OR 
graph: 

AOG(Heci )  =  {{ A,  B,  C,  D ,  F,  X,  Y}, 

{tni1 ,  mij ,  mi3 ,  tn,2i  >  tti22 ,  m.23 ,  cl ,  a r2 ,  a±3  }} 

Given  Heci  and  the  set  of  available  interpretations  in 
AOG(Heci),  algorithm  build-mem()  is  able  to  find  seven  different 
MEMs4: 

MEMs  Equivalent  to  evaluate  the  expression 

{mi! ,  m2l ,  aij }  Fob3  =  Fpred  =  AxC  +  BxD 
{mi^  l  m,2i  ,  tJl2  {  Xpredi  —  A  X  C  =  Xpred2  —  R  R  ^  R 
{mi2,m2l,ai2}  Aoba  =  Apred  =  (F  —  B  x  D)/C,  if  C  0 

{mi3,m2l,ai2}  Coba  =  Cpred  =  (F  -  B  x  D)/A,  if  A  ^  0 

{mi!  1  tti2\  ,  Ctl3  }  Ypredi  —  R  ^  R)  —  T pred2  —  B  X  D 
{mij  ?  tri22 ,  tri3  {  Bobs  =  Bpred  —  ( R  Ax  C)  /  F) ,  if  D  0 

{mi!  5  CTL23  ,  Ctl3  }  F>oba  —  Dpred  —  { R  A  X  C)  /  B ,  if  B  ef-  0 

It  should  be  noticed  that  a  MEC  would  provide  no  MEM  if  the 
over-determined  system  can  not  be  solved  using  available  interpre¬ 
tations  and  local  propagation.  In  [31]  the  reader  can  find  additional 
information  on  how  temporal  information  has  been  included  in  this 
framework  and  one  example  of  a  MEC  which  can  not  provide  any 
MEM. 

Once  summarized  the  possible  conflict  concept,  next  section  stud¬ 
ies  the  relationship  between  MECs,  and  MEMs,  which  are  computed 
off-line,  and  real  conflicts  computed  on-line. 

3  Conflicts  and  possible  conflicts 

If  evaluated,  a  MEM  could  lead  to  discrepancy,  i.e.,  it  could  lead  to 
a  conflict.  However,  the  set  of  MEM  is  computed  off-line,  without 
any  model  evaluation.  And  conflicts  would  appear  only  when  obser¬ 
vations  are  introduced  and  the  evaluation  model  is  computed.  So,  we 
have  introduced  the  following  concept: 

Possible  conflict:  The  set  of  constraints  in  a  Minimal  Evaluation 
Chain  giving  rise  to,  at  least,  one  Minimal  Evaluation  Model. 

For  example,  in  the  polybox  system  in  Figure  1,  there 
are  three  possible  conflicts:  {{mi,  m2,  ai},  {mi,  oi,  a2,m3}, 
{m2,  m3,a2}},  because  every  MEC  has,  at  least,  one  MEM. 

In  such  a  case,  where  component  models  are  made  up  of  only 
one  relation,  the  set  of  possible  conflicts  is  equivalent  to  the  set  of 
minimal  conflicts  in  Reiter’s  terminology  computed  on-line  by  GDE, 
whatever  the  faults  and  whatever  the  set  of  available  observations. 

At  this  point  it  is  necessary  to  answer  the  following  question:  is 
the  set  of  possible  conflicts  equivalent  to  the  set  of  minimal  conflicts 
computed  on-line  by  GDE?  In  order  to  answer,  we  need  additional 
definitions: 

P{S):  is  the  set  of  subsets  in  S; 

4  Since  the  MEM  will  have  the  same  set  of  variables  as  MEC,  we  just  include 
the  set  of  interpretations. 


model  :  COMPS  — >  P(Rsd ):  model(C)  identifies  the  family 
of  relations  modelling  C  behavior; 
comp  :  Rsd  — >  COMPS:  ri  —>  comp(ri)  =  {C  \  r;  £ 
model(C')}: 

comp(ri)  indicates  the  component  containing  relation  r,  in  its 
model. 

Proposition  2  Let  co  foe  a  minimal  conflict  found  by  GDE,  and  co  is 
related  to  a  discrepancy  in  v  £  Vs  n  ■'  there  is  a  minimal  evaluation 
chain,  Hec  =  { Vec ,  Rec},  such  that: 
v  £  Vec  and  co  =  (Jr.gK  comp(n) 

Proof:  GDE  solves  a  minimal  over-determined  system  to  find 
a  minimal  conflict  related  to  v  [19].  Since  build-every-mec() 
performs  exhaustive  search,  it  is  able  to  find  every  minimal 
over-determined  system  in  Hsd-  Hence,  it  will  find  that  over- 
determined  system  too. 

Hence,  once  GDE  finds  a  minimal  conflict,  build-every-mec( )  will 
find  a  MEC  containing  the  same  set  of  constraints  which  were  used 
to  find  a  conflict.  Those  constraints  belong  to  the  same  set  of  compo¬ 
nents. 

Proposition  3  Let  co  be  a  minimal  conflict  found  by  GDE,  and  co  is 
related  to  a  discrepancy  in  v  £  Vsd'-  there  is  a  minimal  evaluation 
model,  H,„e  =  {Vem,  Rem},  that  can  obtain  a  discrepancy  in  v,  and 
v  £  Vem  and  co  =  Ur.eJlem  compirf) 

Proof:  By  proposition  2,  there  is  a  MEC  related  to  co,  such 
that: 

co  =  comp(ri) 

ri  aRec 

Moreover  build-every-mem()  performs  an  exhaustive  search 
too.  Therefore,  it  will  find  every  MEM  related  to  such  MEC, 
i.e.,  every  possible  way  the  MEC  can  be  solved.  Hence,  it  will 
find  the  over-determined  system  used  to  obtain  the  minimal 
conflict.  Also,  each  nk  £  Rem  is  an  interpretation  for  some 
n  £  Rec .  Hence: 

co  =  comp(ri ) 

rikeRem 

At  least  one  of  the  MEM  related  to  the  CEM  will  find  a  discrep¬ 
ancy  in  v,  in  the  same  way  the  GDE  does. 

Unfortunately,  the  number  of  MEMs  for  each  MEC  is  exponen¬ 
tial  in  the  average  number  of  interpretations  for  each  hyper-arc  in  the 
MEC.  Due  to  practical  reasons  we  just  select  one  MEM  related  to  a 
MEC.  Based  on  that  MEM,  we  build  an  executable  model  which  is 
used  for  fault  detection.  In  [31]  the  reader  can  find  a  detailed  descrip¬ 
tion  of  how  possible  conflicts  can  be  used  to  perform  consistency- 
based  diagnosis  for  both  static  and  dynamic  systems. 

Nevertheless,  it  is  still  possible  to  claim  that  the  set  of  possible 
conflicts  is  theoretically  equivalent  to  the  set  of  conflicts  found  on¬ 
line  by  means  of  GDE.  We  will  show  this  fact  in  next  two  proposi¬ 
tions. 

Proposition  4  If  Hec  is  a  MEC,  Hem  is  one  of  its  MEMs  and  the 
evaluation  of  the  executable  model  associated  to  Hem  generates  a 
discrepancy  in  v  £  Vem,  then  GDE  will  find  a  discrepancy  in  v. 

Proof:  There  is  a  discrepancy  in  v  related  to  the  evaluation  of  a 
MEM.  The  MEM  is  an  strictly  over-determined  system.  More¬ 
over,  GDE  finds  any  discrepancy  related  to  any  minimal  over- 
determined  system.  Hence,  it  will  find  the  discrepancy  in  v  too. 


This  proposition  always  holds.  Unfortunately,  the  converse  does 
not  hold  universally,  because  we  can  not  guarantee  for  an  arbitrary  set 
of  non-linear  constraints  that  every  MEM  for  a  MEC  will  provide  the 
same  solution  for  a  given  set  of  observations  [40],  This  assumption 
should  be  stated  in  the  following  way: 

Equivalence  assumption  :  Every  MEM  in  a  MEC  provides  the 
same  set  of  solutions  for  any  given  set  of  input  observations. 

Now,  it  is  possible  to  define  the  following  proposition: 

Proposition  5  If  GDE  finds  a  minimal  conflict,  co,  related  to  a  dis¬ 
crepancy  in  v,  and  the  equivalence  assumption  holds  for  a  Hec  con¬ 
taining  v,  then  the  possible  conflict  related  to  Hec  will  be  confirmed 
as  a  minimal  conflict. 

Proof:  The  proof  is  straightforward  based  on  propositions  2, 
and  3. 

4  Comparing  possible  conflicts,  conflicts,  and 
ARRs 

As  previously  mentioned,  there  is  an  on-going  research  interest  from 
the  DX  and  FDI  communities  in  comparing  their  approaches.  Re¬ 
cently.  Cordier  et  al.  [8]  proposed  a  common  framework  to  com¬ 
pare  conflicts  and  ARRs  [34,  33].  In  that  trend,  we  compare  ARRs 
and  possible  conflicts  considering  the  way  they  are  computed.  After¬ 
wards,  we  discuss  results  in  [8]  and  extract  some  conclusions. 

4.1  Possible  conflicts  and  ARRs 

The  set  of  ARRs  is  obtained  from  the  unique  canonical  decomposi¬ 
tion  of  the  structural  description  of  the  system  into  under-determined, 
just-determined,  and  over-determined  sets  of  constraints.  The  canon¬ 
ical  decomposition  is  based  on  finding  a  complete  matching,  w.r.t. 
unknown  variables,  in  the  bipartite  graph  associated  to  the  structural 
description  of  the  system.  Combination  of  just-determined  systems 
together  with  redundant  relations  is  the  basis  for  an  Analytical  Re¬ 
dundancy  Relation[  34], 

Each  complete  matching  can  be  considered  as  a  causality  assign¬ 
ment,  but  it  is  necessary  to  obtain  a  causal  matching  for  the  over¬ 
determined  system,  from  the  set  of  causal  matchings  satisfying  the 
invertibility  condition  [33].  Each  ARR  can  be  solved  and  used  for 
diagnosis  purposes  once  observed  values  are  introduced. 

It  should  be  noticed  that  all  the  steps,  except  the  solving  one,  could 
be  done  off-line.  Hence,  computing  ARRs  is  a  compilation  technique 
in  FDI.  And.  it  seems  obvious  that  strong  similarities  do  exist  be¬ 
tween  the  way  ARRs  and  possible  conflicts  are  computed. 

•  Both  methods  search  for  over-determined  sub-systems.  Direct  or 
deduced  ARRs  can  be  used  to  estimate  a  value  for  an  observed 
variable  in  the  system.  Moreover,  algorithms  used  for  computing 
MEC,  can  be  used  to  obtain  the  whole  set  of  over-determined  sub¬ 
systems5.  Hence,  the  algorithms  will  find  an  evaluation  chain  with 
the  same  set  of  constraints  as  of  the  ARR. 

•  An  ARR  need  a  causal  matching,  because  not  every  causality  as¬ 
signment  can  be  done  in  the  complete  matching.  In  the  same  way, 
AND-OR  arcs  are  introduced  to  limit  the  ways  an  hyper-arc  can 
be  solved.  It  seems  obvious  that  one  of  the  evaluation  models  for 
an  evaluation  chain  will  be  equivalent  to  the  causal  matching  in 
the  ARR. 

5  It  is  straightforward  to  modify  algorithm  Justify! )  to  search  for  any  over- 
determined  system. 


Line  10 


•  The  set  of  evaluation  models  for  an  evaluation  chain  are  built 
based  on  local  propagation  criterion,  i.e.,  the  evaluation  model 
does  not  contain  any  cycle.  This  condition  has  been  imposed  in 
the  ARR  approach  too.  For  this  reason,  the  ARR  is  obtained  once 
graph  reduction,  by  means  of  loop  elimination,  has  been  done  in 
the  causal  graph  [33].  This  step  is  equivalent  to  loop  elimination 
in  the  possible  conflict  approach  [29]. 

However,  there  are  some  differences: 

•  Staroswiecki  et  al.  [33]  assume  that  in  an  over-determined  sys¬ 
tem  the  set  of  unknowns  can  be  computed  in  different  ways,  using 
constraints  and  known  values,  and  “deduced  redundancy  relations 
are  obtained  writing  that  all  these  results  have  to  be  the  same”. 
This  assumption  is  the  same  as  the  equivalence  assumption  in  the 
previous  section. 

As  mentioned  above,  that  assumption  is  never  done  in  GDE  while 
computing  minimal  conflicts,  because  the  assumption  does  not 
hold  universally  for  physical  systems  made  up  of  general  non¬ 
linear  constraints  [40],  Therefore,  based  on  propositions  4  and 
5,  it  can  not  be  claimed  that  model-based  diagnosis  relying  upon 
ARRs  and  consistency-based  diagnosis  using  conflicts  will  pro¬ 
vide  always  the  same  set  of  results.  Results  obtained  using  ARRs 
would  be  the  same  as  of  those  obtained  using  just  one  MEM  for 
each  MEC.  These  results  can  be  sub-optimal,  w.r.t.  the  number  of 
detected  conflicts,  unless  the  equivalence  assumption  holds. 

•  Moreover,  build-every-mec()  provides  the  whole  set  of  minimal 
evaluation  chains,  because  we  look  for  minimal  conflicts.  This  is 
not  guaranteed  in  the  original  ARR  approach,  which  should  be 
revised  to  find  just  minimal  ARRs. 


4.2  Discussion 


Cordier  et  al.  [8]  defined  the  support  for  an  ARR  as  “the  set  of  com¬ 
ponents  involved  in  the  ARR”.  This  term  was  also  called  “potential 
R-conflict”,  because  of  their  Proposition  4.1: 


FT-01 

Inflow 


TR-1 


FT-04 


Figure  2.  Scheme  of  the  system  to  be  diagnosed.  Measured  variables  are 
flows  FT01  =  /*,  FT02  =  /*,  FT03  =  /*,  and  FT04  =  f*p,  level  of 
tank  LT 05  =  h}  /t,., .  and  the  value  of  the  control  action  on  valve  V2  =  u2 
at  the  output  of  tank  TR2. 

Its  related  hyper-graph  can  be  described  as: 

Hsu  =  {Vsd, -Rsd}; 

Vsd  =  {OBS  U  NOBS}; 
o  BS  =  {ft,f;,f5,f;1,hZ.Ray, 

NOBS  =  {/9,  /10,  /12,  /«,  mrKl ,  mTfll ,  ftrit! ,  ,  mr2 , 

hr2 1  ttiTR2  5  ttiT r2  5  it 2  >  ^Pp2  >  ABp? j  >  >  R2 Tr 1 >  Rt  t2  ■ 

C2 7  2  ?  Cl  TR2  ■  ^2TR2  '  Ucont2  } 

R-sn  =  {trl1,trl2,trl3,trl4,t21,t22,t23,t2i,t25,p21,p22, 
p23,p31,p32,p33,v21,v22,tr21,tr22,tr23,  tr24 ,  tr25 ,  t r 26 } 

The  meaning  for  each  equation  above  can  be  found  in  Appendix 
D.  We  have  used  common  equations  for  computing  mass  balances, 
overflows,  and  so  on.  Analyzing  the  system  we  have  found  three  pos¬ 
sible  conflicts.  The  reader  should  notice  that  PC3  is  minimal  w.r.t. 
constraints,  but  not  minimal  w.r.t.  components. 


PC; 

Components 

{trli, trl2, trl3, trl4, t2\,t22,t23, 
I24,  t25,p2i,p22,p23} 

{tr21,tr23,p31,p32,p33,v21,v22} 
(frli, trl2, trl3, trl4, £2i, t22, t23, 
t23,p23,tr2i,tr25,  tr2& , p33 ,  v2i} 

{rf?1,r2,p2} 

{TR2,P3,V2} 
{TRi,T2,  p2,tr2, 
p3,v2} 

“Let  OBS  be  a  set  of  observations  for  a  system  modeled  by 
SM  (resp.  SD).  There  is  an  identity  between  the  set  of  minimal 
R-conflicts  for  OBS  and  the  set  of  minimal  potential  R-conflicts 
associated  to  the  ARRs  which  are  not  satisfied  by  OBS.” 

As  stated  in  the  previous  section,  we  think  it  is  necessary  to  make 
three  explicit  assumptions  to  guarantee  that  such  a  conclusion  holds 
universally: 

•  the  equivalence  assumption  holds, 

•  the  set  of  ARRs  is  built  based  on  minimality  criteria,  and 

•  we  have  a  component-oriented  behavior  description  of  the  system, 
but  minimality  is  considered  w.r.t.  sets  of  constraints. 

Regarding  first  two  conditions,  it  seems  obvious  that  proposition 

5  in  Section  3  is  equivalent  to  proposition  4.1.  in  [8]  when  both  as¬ 
sumptions  hold.  Third  assumption  must  be  taken  into  account  when 
behavioral  models  are  made  up  of  more  than  one  constraint.  Mini¬ 
mality  w.r.t.  sets  of  constraints  is  needed  because  not  every  possible 
conflict  is  equivalent  to  a  minimal  conflict  in  Reiter’s  framework.  We 
will  illustrate  this  using  the  system  in  Figure  2.  The  system  is  made 
up  of  common  components  in  process  industry  such  as  tanks,  pumps, 
valves,  and  so  on. 


5  Conclusions 

In  this  paper  we  have  shown  that  compilation  of  dependencies  by 
means  of  the  possible  conflict  approach  is  theoretically  equivalent  to 
on-line  dependency  recording  in  GDE.  However,  it  is  not  possible 
to  claim  that,  in  practice,  consistency-based  diagnosis  using  possible 
conflicts  provides  the  same  results  as  GDE  does,  unless  the  equiva¬ 
lence  assumption  holds. 

We  have  found  out  that  the  model  of  an  ARR  is  equivalent  to  some 
evaluation  model  for  an  evaluation  chain.  Since  we  select  just  one 
MEM  for  each  MEC  for  practical  reasons,  we  conclude  that  both 
approaches  can  obtain  equivalent  results  (assuming  ARRs  are  com¬ 
puted  based  on  minimality  criteria). 

Finally,  we  have  concluded  that  Proposition  4.1  in  [8]  need  to  be 
revised  taking  into  account  results  in  propositions  4  and  5,  and  con¬ 
sidering  minimality  criteria  w.r.t.  constraints. 
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A  Graph  and  hyper-graph  notation 

H  =  [V,  E]  Hyper-graph  H,  made  up  V :  nodes,  and 
E:  a  family  of  sub-sets  in  V 
rs  Successors  for  node  i 

F”1  Predecessors  for  node  i 

d,H  ( i )  Degree  for  node  i  in  H 

d ^  ( i ) ,  dfj  ( i )  Output  and  input  demi-degree  for  node  i  in  H 

Bipartite  graph:  G  =  [V.  E]  is  a  bipartite  graph  if  there  are  two 
disjoints  parts  in  V  =  S  U  T,  and  edges  in  E  are  always  directed 
from  S  to  T. 

Matching:  A  matching  M  in  G  =  [V,  E]  is  a  subset  of  E  such  that 
no  two  arcs  in  M  share  a  common  vertex  incident  to  them. 

B  Algorithms  for  computing  the  set  of  minimal 
evaluation  chains 

Algorithm  build-every-mec  (SMEC)  is 

SMEC:  set  of  MEC;  {  Each  MEC  is  a  set  of  constraints} 
available,  to-be-justified,  justified,  chain:  set  of  constraints; 

R,  R2:  constraint; 

Begin 

available  :=  Constraints-infiFso); 
while  available  7^  0  do 
R  :=  Select-constraint( available); 
chain  :=  0; 

available  :=  available  \  {R}; 
build-mec  (SMEC.  chain,  R.  available); 
end  while 
End 

Algorithm  build-mec  (SCEM,  chain,  R,  available)  is 
Begin 

Insert  R  in  chain; 
to-be-justified  :=  R.nobs; 
justified  :=  0; 

Justify  (SMEC,  chain,  to-be-justified,  justified,  available); 

End 

Algorithm  Justify  (SMEC,  chain,  to-be-justified,  justified,  avail¬ 
able)  is 

v:  unknown  variable; 
related:  set  of  constraints; 

Begin 

if  to-be-justified  =  0  then 
if  there  is  no  subset  of  chain  in  SMEC  then 
Erase  chain  supersets  from  SMEC; 

Insert  chain  in  SMEC; 

end  if  {  Only  minimal  chains  are  included  in  SMEC.} 
else 

v  :=  select-variable  (to-be-justified); 
related  :=  R  j  R  €  available  and  v  £  R.nobs; 
while  related  7^  0  do 
R1  :=  select-r  (related); 
related  :=  related  \  {Rl}; 
chain2  :=  chain  U  {-Rl}; 

Justified2  :=  Justified  U{t>}; 

to-be-justified2  :=  (to-be-justified  \  v)  U  (Rl.nobs  \  justified2  }; 
available2  :=  available  \  Rl; 

Justify  (SMEC,  chain2,  to-be-justified2,  justified2,  available2); 
end  while 
end  if 
End 


C  Algorithms  for  computing  the  set  of  minimal 
evaluation  models 

Algorithm  build-every-mem  (SMEC,  SMEM)  is 
Begin 

for  chain  =  each  MEC  in  SMEC  do 
for  R  =  each  constraint  in  chain  do 
for  I  =  each  interpretation  for  R  do 
model  :={/}; 
to-be-justified:=  I. nobs; 
justified  :=  0; 
chain  :=  chain  \  {R}; 

build-mem  (model,  chain,  to-be-justified,  justified,  SMEM); 
end  for 
end  for 
end  for 
End 

Algorithm  build-mem  (model,  available,  to-be-justified,  justified, 
SMEM)  is 

Begin 

if  to-be-justified  =  0  and  available  =  0  and  3i  discrepancy  node  in 
model  then 

Insert  model  in  SMEM; 
end  if 
else 

for  S  =  each  constraint  in  available  do 
if  S.nobs  n  to-be-justified  =  0  then 
for  12  =  each  interpretation  for  S  do 
i/'head(I2)  D  to-be-justified  7^  0  then 
Insert  {12}  in  model; 
available  :=  available  \  {S}; 

to-be-justified  :=  (to-be-justified  \  head(I2))  U  tail(I2).nobs; 
Insert  head(I2)  in  justified; 

Build-mem  (model,  available,  to-be-justified,  justified,  SMEM); 
end  if 
end  for 
end  if 
end  for 
end  if 
End 

D  Constraints  used  to  model  the  hydraulic  system 


Constraints 

frli,  t2i,  fr24 

Represent 

Mass  balance  in  T:  m'T  =  font 

tr 12,£22 

Overflow  in  T:  fout  =  yjk  ■  ( hT  -  hext) 

trl3,t23,tr25 

Mass:  rriT(t)  =  f  m'T(t  —  1  )dt  +  rriT{t  —  1) 

trl4,t25,tr26 

Height  in  T:  :  h t  =  &i  • 

t24,tr22 

Pressure  at  bottom:  Pt±  =  &2  •  hT  +  Patm 

p2i,p32 

Pump  load  curve  in  P:  A Pp  =  tablePQ(fout) 

p22,p3i 

n  ■  t  f  ll  (P^+^-Pp-Pl) 

Outflow  in  T:  fout  =  y  «3  • - 1 ^ - 

p23,p33,  v2i 

Flow  out  of  tank:  fin  =  fout 

fr2i 

Control:  u  =  PID(Ht) 

v22 

Flow  through  a  valve:  fout  =  J kts  ■  wo^ 
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Abstract.  Technical  systems  are  in  general  not  guaranteed 
to  work  correctly.  They  are  more  or  less  reliable.  One  main 
problem  for  technical  systems  is  the  computation  of  the  reli¬ 
ability  of  a  system.  A  second  main  problem  is  the  problem  of 
diagnostic.  In  fact,  these  problems  are  in  some  sense  dual  to 
each  other. 

In  this  paper,  we  will  use  the  concept  of  probabilistic  ar¬ 
gumentation  systems  PAS  for  modeling  the  system  descrip¬ 
tion  as  well  as  observation  and  specifications  of  behaviour  in 
one  common  framework.  We  show  that  PAS  are  a  framework 
which  allows  to  formulate  both  main  problems  easily  and  all 
concepts  for  these  two  problems  can  clearly  be  defined  therein. 
Using  PAS,  reliability  and  diagnostic  can  be  considered  as 
dual  problems.  PAS  allows  to  consider  one  common  strategy 
for  computing  answers  to  the  questions  in  the  different  situa¬ 
tions. 

1  Introduction  and  Overview 

One  main  problem  for  technical  systems  is  the  computation 
of  the  reliability  of  a  system.  This  is  studied  in  reliability 
theory  (see  for  example  [7,  8]).  The  reliability  depends  on 
various  factors  like  the  quality  and  the  age  of  components, 
complexity  of  the  system,  etc.  The  reliability  of  a  system  con¬ 
veys  some  information  about  the  behavior  of  the  system  in 
the  future,  based  on  information  about  the  components,  for 
example  probabilistic  information  about  the  reliability  over 
time. 

A  second  main  problem  for  technical  systems  is  the  prob¬ 
lem  of  diagnostic.  Here,  the  problem  is  to  explain  the  behavior 
of  the  system,  usually  based  on  measurements  and  observa¬ 
tions  of  some  parts  of  the  system,  together  with  the  system 
description  in  some  framework.  The  actual  observations  and 
the  description  of  the  system  are  the  only  ingredients  for  the 
computation  of  the  diagnoses.  Additionally,  if  probabilistic 
knowledge  is  available  about  the  different  operating  modes  of 
the  components,  then  the  likelihood  of  the  system  states  can 
be  defined  and  prior  as  well  as  posterior  probabilities  can  be 
computed  for  the  set  of  possible  system  states. 

*  Research  supported  by  grant  No.  2000-061454.00  of  the  Swiss 

National  Foundation  for  Research. 


Figure  1.  Reliability  versus  Diagnostic. 


The  two  main  problems  depend  both  on  a  formalization  of 
the  system  in  some  framework  together  with  either  observa¬ 
tions,  measurements,  or  requirements  (Fig.  1).  Here,  we  will 
use  the  concept  of  probabilistic  argumentation  systems  PAS 
for  modeling  the  system  description  as  well  as  observation  and 
specifications  of  behaviour  in  one  common  framework.  The 
goal  of  a  PAS  is  to  derive  arguments  in  favor  and  against  the 
hypothesis  of  interest.  An  argument  is  a  defeasible  proof  built 
on  uncertain  assumptions,  i.e.  a  chain  of  deductions  based 
on  assumptions  that  makes  the  hypothesis  true.  If  probabilis¬ 
tic  information  is  available,  a  quantitative  judgement  of  the 
situation  is  obtained  by  considering  the  probabilities  that  the 
arguments  are  valid.  The  resulting  degrees  of  support  and  pos¬ 
sibility  correspond  to  belief  and  plausibility,  respectively,  in 
the  Dempster-Shafer  theory  of  evidence  [24,  20].  In  fact,  PAS 
combines  the  strengths  of  logic  and  probability  in  one  frame¬ 
work.  In  this  paper  we  show  that  probabilistic  argumentation 
systems  are  a  framework  which  allows  to  formulate  both  main 
problems,  i.e.  reliability  and  diagnostic,  easily  and  all  concepts 
therefore  can  clearly  be  defined  therein.  The  framework  will 
especially  allow  to  consider  one  common  strategy  for  comput¬ 
ing  answers  to  the  questions  in  the  different  situations.  Some 
work  in  this  direction  but  without  using  PAS  has  been  done 
by  Provan  [22]. 

The  main  information  for  both  problems  is  the  description 
of  the  system  in  some  formalism;  we  will  focus  here  on  a  for- 


malization  using  logic.  In  the  case  of  reliability,  we  may  have 
a  specification  which  describes  the  goals  which  have  to  be  ful¬ 
filled  by  the  system.  This  information  will  be  used  to  compute 
the  structure  function  from  the  system  description.  Different 
specifications  may  lead  to  different  structure  functions.  Even 
in  the  absence  of  an  explicit  specification  of  a  reliability  re¬ 
quirement,  we  may  deduce  a  structure  function  by  assuming 
that  the  system  should  be  functioning  at  least  if  all  compo¬ 
nents  are  working. 

On  the  other  hand,  in  the  case  of  diagnostic,  some  obser¬ 
vations  of  the  system  may  indicate  that  the  system  is  not 
working  as  it  is  supposed  to  be.  This  information  —  together 
with  the  system  description  —  allows  then  to  compute  the  di¬ 
agnoses  of  the  system,  i.e.  minimal  sets  of  components  whose 
malfunctioning  “explains”  the  wrong  behaviour  of  the  whole 
system. 

2  Reliability 

2.1  Combinatorial  Reliability 


Figure  2.  A  simple  device 


Dually,  the  notion  of  a  cut  is  defined  and  C  denotes  the  set  of 
all  minimal  cuts. 

If  for  every  component  i  =  1,2, ...  ,n  its  respective  prob¬ 
ability  pi  of  functioning  correctly  is  defined,  then  the  prob¬ 
ability  that  the  system  is  functioning  can  be  computed  (as¬ 
suming  the  components  to  be  stochastically  independent).  In 
fact,  0(x)  is  a  random  variable,  and  the  probability  p  that  the 
system  is  functioning  is 

p  =  £(</>(x))  =  h(p).  (3) 


In  binary  combinatorial  reliability,  a  system  is  assumed  to  be 
composed  of  a  number  of  different  components.  Each  com¬ 
ponent  is  either  intact  or  it  is  down,  and  so  is  the  whole 
system  itself,  depending  on  the  states  of  its  components.  In 
order  to  formulate  this,  binary  variables  Xi  are  associated  to 
components  i  =  1,2,  ...,n  of  the  system,  where  Xi  =  T  if 
the  component  number  i  works  and  Xi  =  _L  otherwise.  Let  x 
be  the  vector  ( xi,X2 , . . .  ,xn )  of  the  component  states.  This 
state-vector  has  2n  possible  values.  These  values  can  be  de¬ 
composed  into  two  disjoint  subsets,  the  set  St  of  working 
states,  for  which  the  system  as  a  whole  is  assumed  to  be  func¬ 
tioning,  and  the  set  S±  of  down-states,  for  which  the  system 
is  supposed  to  not  work  properly.  The  corresponding  system 
state  is  denoted  by  x.  Its  dependence  on  the  state-vector  x  is 
described  by  a  Boolean  function  <j>,  defined  as 


x  =  4>{x.) 


T  if  x  €  St, 
-L  if  x  £  S j_. 


The  Boolean  function  <j>  is  called  the  structure  function  of  the 
system.  In  combinatorial  reliability  it  is  assumed  to  be  given 
and  it  forms  the  base  for  reliability  analysis. 

The  structure  function  <j>  is  usually  assumed  to  be  mono¬ 
tone.  That  is,  if  xi  <  X2,  then  </>(xi)  <  0(x2).  For  a  monotone 
structure  function,  a  subset  PC  {1,  2, . . . ,  n}  of  components 
is  called  a  path,  if  <(>(x)  =  T  for  all  state-vectors  x  for  which 
the  components  of  the  set  P  are  working,  Xi  =  T  for  all  i  £  P. 
That  is,  the  elements  of  a  path  are  sufficient  to  guarantee  the 
functioning  of  the  system,  regardless  of  the  state  of  the  com¬ 
ponents  outside  the  path.  We  assume  that  the  set  {1,2 , ...  ,n} 
of  all  components  is  a  path  (otherwise  the  system  would  never 
be  functioning) .  A  path  P  is  called  minimal,  if  no  proper  sub¬ 
set  of  P  is  still  a  path.  Since  the  paths  are  upwards  closed  it 
is  sufficient  to  know  all  minimal  paths.  Let  V  denote  the  set 
of  minimal  paths.  This  set  determines  the  structure  function, 

0(x)  =  V  A  Xi'  (2) 

pgv  iep 

This  logical  formula  expresses  the  fact,  the  system  is  working, 
if  all  components  of  at  least  one  minimal  path  are  working. 


Here,  p  denotes  the  vector  (pi,p2,  ■  ■  ■  ,Pn)  of  probabilities. 
h(  p)  is  called  the  reliability  function  and  its  computation  is 
a  nontrivial  task  [1,  16,  5]. 


2.2  Model-Based  Reliability 

The  structure  function  describes  the  conditions  under  which 
a  system  is  functioning,  depending  on  the  states  of  its  com¬ 
ponents.  It  is  already  a  compilation  of  knowledge  about  the 
system  and  its  structure.  In  this  section  we  shall  illustrate  an¬ 
other  approach,  where  a  more  physical  description  of  a  system 
is  given.  Additionally,  a  specification  of  the  desired  behavior 
of  the  system  is  given.  These  two  elements  will  then  allow  the 
deduction  of  a  structure  function  and  its  associated  reliability 
function.  The  discussion  in  this  section  will  be  informal. 

Example  1 :  Detector  of  Power  Failure 

(Example  adapted  from  [22]) 

Consider  a  simple  device  which  watches  a  Boolean  value  in 
and  reports  an  output  out  equal  to  T,  if  the  value  vanishes 
(becomes  _L).  A  simple  version  of  such  a  device  is  depicted 
in  Figure  2.  The  functionality  of  this  device  can  be  described 
with  propositional  logic.  Let  in  and  out  be  the  variables  which 
denote  the  state  of  the  in-  and  output  respectively.  Both  vari¬ 
ables  are  binary,  i.e.  represent  the  boolean  values  true  or  false 
respectively.  Further,  there  are  two  internal  variables  xi  and 
*2,  also  binary.  For  every  component  A,  B  or  C,  there  is  a 
respective  binary  variable  ok  a,  okB,  and  okc  which  describes 
the  working  mode  of  the  component. 

Consider  the  inverter  A:  if  it  works  correctly  (ok a  is  true), 
then  its  input  is  the  negation  of  its  output,  out  is  true  if  and 
only  in  is  false.  We  express  this  by  the  formula  in  <->  ->*i.  So 
the  entire  information  is  modeled  as  the  logical  implication 
okA  — >  (in  <->  -ix i).  Note  that  so  far  nothing  is  said  about 
the  behavior  of  the  component,  if  it  is  down  (ok a  is  false). 
There  are  several  possibilities.  One  is  that  in  this  case  the 
output  of  the  component  is  always  false,  i.e.  -ioA;a  — >  _lXi. 

For  the  component  B ,  the  same  specification  can  be  ap¬ 
plied.  For  the  or-gate,  if  it  works  correctly,  then  the  output  is 


true  if  at  least  one  of  its  inputs  is.  So  the  whole  information 
about  the  device  is  modeled  by  a  set  of  six  implications: 


(  okA  — i •  (in  <->  -1x1), 

-lOfcA  — >  ~<Xl, 

E  =  <  okB  — >  (in  -1*2), 

->okB  — >  ~lX2, 

[  okc  — >  (out  <->  xi  V  X2), 

—<okc  — >  “i out 

This  is  the  system  description.  We  add  now  a  specification  of 
what  we  expect  from  the  system  to  this  physical  description  of 
the  system.  We  expect,  that  negative  (false)  input  is  detected, 
i.e.  the  output  is  true.  This  could  be  expressed  by  -> in  — >  out. 
However,  this  is  a  weak  requirement.  It  does  not  exclude  that 
out  becomes  true,  even  if  in  is  true.  More  stringent  would  be 
the  specification  -i in  <->  out.  This  asks  that  there  is  an  alarm 
(out)  if,  and  only  if,  in  is  false. 

We  may  now  ask  under  which  states,  described  by  the  vari¬ 
ables  ok  a,  okB,  and  okc,  each  one  of  these  specifications  is 
fulfilled.  This  defines  the  structure  function  of  the  system  as¬ 
sociated  with  the  corresponding  specification  of  desired  sys¬ 
tem  behavior.  We  shall  see  in  the  next  section,  that  it  is  a 
well-defined  problem  of  propositional  logic  to  deduce  these 
structure  functions  from  the  system  description  and  the  spec¬ 
ifications  of  desired  behavior.  © 

This  example  shows  how  the  physical  behavior  of  systems 
and  the  required  behavior  can  be  described  in  the  language 
of  propositional  logic.  We  shall  examine  this  structure  in  the 
following  section  in  a  general  context. 

3  Probabilistic  Argumentation  Systems 

Probabilistic  argumentation  systems  have  been  developed  as 
general  formalisms  for  expressing  uncertain  and  partial  know¬ 
ledge  and  information  in  artificial  intelligence.  They  combine 
in  an  original  way  logic  and  probability.  Logic  is  used  to  derive 
arguments  and  probability  serves  to  compute  the  reliability  or 
likelihood  of  these  arguments.  These  systems  can  be  used  for 
model-based  diagnostics  as  has  been  demonstrated  in  [2,  19]. 
Here  we  shall  show  how  they  relate  to  reliability  theory. 

In  this  section  we  give  a  short  introduction  into  proposi¬ 
tional  probabilistic  argumentation  systems.  For  a  more  de¬ 
tailed  presentation  of  the  subject  we  refer  to  [15].  We  remark 
also  that  such  systems  have  been  implemented  in  a  system 
called  ABEL  which  is  available  on  the  internet  (cf.  [14]  for 
further  information). 

3.1  Propositional  Logic 

Propositional  logic  deals  with  declarative  statements,  called 
called  propositions ,  that  can  be  either  true  or  false.  Let 
P  =  {pi, . . .  ,pn}  be  a  finite  set  of  propositions.  The  sym¬ 
bols  pi  £  P  together  with  T  (tautology)  and  T  (falsity),  are 
called  atoms.  Compound  formulas  are  built  by  the  following 
syntactic  rules: 

•  atoms; 

•  if  7  is  a  formula,  then  -17  is  a  formula; 

•  if  7  and  5  are  formulas,  then  (7  A  <5),  (7  V  5),  (7  — >  5),  and 
(7  <->  5)  are  formulas. 

By  assigning  priority  in  decreasing  ordering  to  -1,  A,  V,  — 7 
some  parentheses  can  be  eliminated.  The  set  Cp  of  all  formu¬ 
las  generated  by  the  above  recursive  rules  is  called  proposi¬ 
tional  language  over  P. 


A  literal  is  either  an  atom  p;  or  the  negation  of  an  atom 
-1  pi.  A  term  is  either  T  or  a  conjunction  of  literals  where 
every  atom  occurs  at  most  once  (but  none  of  T  and  T),  and 
a  clause  is  either  1  or  a  disjunction  of  literals  where  every 
atom  occurs  at  most  once  (but  none  of  T  and  T).  Cp  C  Cp 
denotes  the  set  of  all  terms,  and  Dp  the  set  of  all  clauses. 

Np  =  (0,  l}n  denotes  the  set  of  all  2"  different  interpreta¬ 
tions  for  P.  If  7  £  Cp  evaluates  to  1  under  x  £  Np,  then  x  is 
called  a  model  of  7.  The  set  of  all  models  of  7  is  denoted  by 
Np (7)  C  Np. 

A  propositional  sentence  7  entails  another  sentence  5  (de¬ 
noted  by  7  |=  5)  if  and  only  if  Np( 7)  C  Np(5).  Sometimes, 
it  is  convenient  to  write  x  |=  7  instead  of  x  £  Np( 7).  Also 
we  write  7  |=  T  if  7  is  not  satisfiable.  Furthermore,  two  sen¬ 
tences  7  and  5  are  logically  equivalent  (denoted  by  7  =  <5),  if 
and  only  if  Np( 7)  =  Np(8). 

3.2  Basic  Concepts  of  Argumentation 
Systems 

Consider  two  finite  sets  P  =  {pi,...,pm}  and  A  = 

{01, . . . ,  a„}  of  propositional  variables  with  ACP  =  0,  the  ele¬ 
ments  of  P  are  called  propositions,  the  elements  of  A  assump¬ 
tions.  We  consider  a  fixed  set  of  formulas  E  C  Caup  called  the 
knowledge  base,  which  models  the  information  available;  sets 
of  formulas  are  interpreted  conjunctively,  i.e.  E  =  /\{£  £  E}. 
We  assume  that  this  knowledge  base  is  satisfiable.  A  triple 
(E,  A,  P)  is  called  a  propositional  argumentation  system  PAS. 

The  elements  of  Na  are  called  scenarios  (or  system  states). 
A  scenario  represents  a  specification  of  all  values  of  the  as¬ 
sumptions  in  A.  Define  now: 

Inconsistent  Scenarios:  CSa( S)  :=  {s  £  Na  :  s,  E  |=  J_}, 

Quasi-Supporting  Scenarios  of  h  £  Cn- 

QSA(h,  E)  :=  {s  £  Na  :  s,  S  (=  h}, 

Supporting  Scenarios  of  h  £  Cn- 

SPA(h,  E)  :=  QSA(h,  S)  -  CSA( E), 

Possible  Scenarios  for  h  £  Cn- 

PLA(h,  E)  :=  SPAchh,  E). 

Inconsistent  scenarios  are  in  contradiction  with  the  know¬ 
ledge  base  and  therefore  to  be  considered  as  excluded  by  the 
knowledge.  Supporting  scenarios  for  a  formula  h  are  scenar¬ 
ios,  which,  together  with  the  knowledge  base  imply  h  and 
are  consistent  with  the  knowledge.  So,  under  supporting  sce¬ 
narios,  the  hypothesis  h  is  true.  Possible  scenarios  for  h  are 
scenarios,  which  do  not  imply  -A  and  thereby  do  not  refute  h. 
Quasi-supporting  scenarios  for  h  are  the  union  of  supporting 
scenarios  and  inconsistent  scenarios. 

Scenarios  are  the  basic  concepts  of  assumption-based  rea¬ 
soning.  However,  sets  of  inconsistent,  quasi-supporting,  sup¬ 
porting  and  possible  scenarios  may  become  very  large.  There¬ 
fore,  more  economical,  logical  representations  of  these  sets  are 
needed.  For  this  purpose,  the  following  concepts  are  defined: 

Set  of  Supporting  Argument  for  h: 

SP(h,  E)  =  {a  £  Ca  :  NA(a)  C  SPA(h,  E)}, 

The  sets  of  quasi-supporting  and  of  possible  arguments  are 
defined  analogously.  Remark  that  supporting  arguments  are 
similar  to  paths  for  structure  functions  in  reliability  the¬ 
ory.  This  similarity  will  be  exploited  later.  These  sets  are 


all  upward  closed.  Hence  the  sets  of  arguments  are  al¬ 
ready  determined  by  their  minimal  elements.  We  denote  by 
pQS(h,T,),  pSP^,^)  and  pPL{h,  E)  the  sets  of  minimal 
quasi-supporting,  supporting  and  possible  arguments.  Fur¬ 
ther, 

Conflict:  conf( E)  :=  \J  a, 

a£  mQS(X,E) 

Support  of  h:  sp(h,  E)  :=  \J  a., 

ae/J.SP(h,-E) 

Quasi-support  qs(h,  E)  and  possibility  pl{h,  E)  are  defined 
analogously.  Clearly,  any  formula  which  is  logically  equivalent 
to  logical  representations  above  can  be  used  as  a  representa¬ 
tion. 

Example  2:  ( Cont.  of  Example  1 ) 

The  information  of  Example  1  is  modeled  in  an  argu¬ 
mentation  system  as  follows:  A  =  {ok a,  okB ,  okc},  P  = 
{in,xi,X2,out}  and  E  as  in  (4).  There  are  no  incon¬ 
sistent  scenarios  and  for  h  =  -i in  — >  out  we  have 
QSA{h ,  E)  =  {(0, 1, 1),  (1,0,  1),  (1, 1, 1)}  and  PLA(h,  E)  = 
Na.  As  CSa(Tj)  =  0,  we  have  QSA  =  SPA  in  this  situation 
and  there  are  some  arguments  in  favor  of  the  hypothesis,  but 
none  against  it.  Hence,  qs (h,  E)  =  ( okA  A  okc)  V  (ofcs  A  okc) 
and  pl(h ,  E)  =  T.  © 

3.3  Probabilistic  Information 

On  top  of  the  structure  of  a  propositional  argumentation  sys¬ 
tems,  we  may  easily  add  a  probability  structure.  Assume  that 
there  is  a  probability  p(ai)  =  pi  for  every  assumption  m  £  A 
given.  Assuming  stochastic  independence  between  assump¬ 
tions,  a  scenario  s  =  (si, . . . ,  sn)  gets  the  probability 

n 

p(  s)  =  ntf*1-*)1-*- 

i=  1 

This  induces  a  probability  measure  p  on  CA, 

p(f)  = 

se  NA(f) 

for  /  £  CA.  A  quadruple  (E,  A ,  P,  n)  with  n  =  (pi, . . .  ,pn)  is 
then  called  a  probabilistic  (propositional)  argumentation  sys¬ 
tem  PAS. 

The  problem  of  computing  the  probability  p(f)  is  similar  to 
the  problem  of  computing  the  reliability  of  a  structure  func¬ 
tion,  except,  that  monotonicity  cannot  be  assumed  in  general; 
for  algorithms  for  efficiently  computing  the  probability  p(f) 
see  [20,  9,  13]. 

Once  we  have  such  a  probability  structure  on  top  of  a 
propositional  argumentation  system,  we  can  exploit  it  to  com¬ 
pute  likelihoods  (or  in  fact,  reliabilities)  of  supporting  and 
possible  arguments  for  hypothese  h.  First,  we  note,  that  E 
imposes  that  we  eliminate  the  inconsistent  scenarios  and  con¬ 
dition  the  probability  on  the  consistent  ones.  In  other  words, 
E  is  an  event  that  restricts  the  possible  scenarios  to  the  set 
Na  —  CSA{ E),  hence  their  probability  has  to  be  conditioned 
on  the  event  E.  This  conditional  probability  is  defined  by 

p'(s)  =  _ M _ 

py>  l  —  p(qs(-L,  E))  ’ 


for  consistent  scenarios  s.  p(qs(h,  E))  =  dqs(h)  is  the  so-called 
degree  of  quasi-support  for  h.  Now,  the  degree  of  support  dsp 
for  hypotheses  h  is  defined  by 


dsp{h)  =  p'(sp(h,  E)) 


dqs(h ,  E)  —  dqs(.L,  E) 
1  —  dqs(.L,  E) 


This  result  explains  the  importance  of  quasi-support.  It  is 
sufficient  to  compute  degrees  of  quasi-supports.  Further,  we 
obtain  the  degree  of  plausibility  of  h , 

dpl(h)  =  p(pl(h ,  E))  =  ^  -ddqs{±’^)  =  1  “  dsP^h)- 


Degree  of  quasi-support  dqs(h )  and  of  support  dsp(h)  corre¬ 
spond  in  fact  to  unnormalized  and  normalized  belief  in  the 
Dempster-Shafer  theory  of  evidence  [24,  20,  15]. 


3.4  Computational  Theory 

Computing  quasi-supports  is  the  basic  operation  in  PAS.  It 
can  be  based  on  resolution  and  variable  elimination  (forget¬ 
ting)  [15,  12,  13].  In  the  sequel,  we  will  sketch  some  of  the 
main  concepts  for  computation. 

First,  note  that  the  computation  of  qs(h)  can  be  reduced  to 
the  computation  of  the  conflicts  with  respect  to  an  updated 
knowledge  base:  qs(h,  E)  =  gs(_L,E  U  { — 'h}) .  So  for  any  hy¬ 
pothesis  h,  the  quasi-supporting  arguments  gs(/i,E)  can  be 
determined  by  computing  the  conflicts  with  respect  to  the 
knowledge  base  E  U  { — i/x} .  Hence  in  the  sequel,  we  focus  on 
the  computation  of  the  conflicts  with  respect  to  a  general 
knowledge  base. 

The  ideas  presented  in  the  sequel  are  based  on  representa¬ 
tions  of  knowledge  in  conjunctive  normal  form  (CNF),  i.e.  a 
conjunction  of  clauses.  The  main  step  is  based  on  the  princi¬ 
ple  of  resolution.  Let  *  £  A  U  P.  A  disjoint  decomposition  of 
E  is  then  defined  as  follows: 

s+  =  (e  £  E  :  X  £  LitiO} 

Ex  =  {£  £  E  :  ~<x  £  Lit(£)} 

=  {£  G  S  :  x  ^  Lit(l ;)  and  ->x  £  Lit(£)} 

Lit) E)  denotes  the  set  of  all  literals  occurring  in  E.  A  literal 
is  either  a  (positive)  atom  or  a  negated  atom. 

Consider  two  clauses  =  x  V  <5+  and  =  -ix  V  S~  in  Ej 
and  E“  respectively.  The  clause  p(£+,  (~)  =  5+  V  S~  is  called 
the  resolvent;  note  that  we  simplify  implicitly  the  resolvent 
so  that  p(£+,£_)  is  again  a  clause,  i.e.  double  occurrences  of 
atoms  etc.  are  simplified. 

Eliminating  a  variable  x  £  P  U  A  from  E  means  now  to 
compute 

Elimx( E)  =  a*(E“  U{p(£+,r)  :  ?+  G  C  G  Sj}) 

Consider  a  set  Q  C  P\jA.  We  define  now,  for  Q  =  {q\ , . . .  ,qr}, 

ElirriQ  (E)  =  Elimqr  (. . .  ( Elimq2  (Elimqi  (E))) . . .) 

The  result  does  not  depend  on  the  very  order  of  the  elimina¬ 
tion  of  atoms;  yet  note  that  the  computations  depend  criti¬ 
cally  on  a  “good”  ordering,  see  [15]  for  a  discussion  as  well  as 
relations  to  the  theory  of  local  computation  (in  the  sense  of 
Shenoy  &  Shafer  [25]). 

This  allows  then  to  compute  the  quasi-supporting  argu¬ 
ments  of  a  knowledge  base  E  as  follows: 


Theorem  1  ([15]) 

QSA(h,  E)  =  N%{ElimP(Z\J  {-<h})) 


With  this  less  complete  model,  the  structure  function  of  the 
two  specifications  above  become  different, 


In  other  words,  this  theorem  asserts  that 

qs(h,  S)  =  -i  y/\  Elimp( E  U  {->/i}). 

The  concept  of  elimination  allows  to  compute  quasi¬ 
supporting  and  therefore  also  supporting  as  well  as  possible 
arguments  for  hypotheses.  This  notation  connects  the  con¬ 
cepts  presented  here  to  the  more  general  theory  of  valuation 
algebras,  a  general  theory  for  representing,  combining  and  fo¬ 
cusing  pieces  of  information  [18,  21]. 

4  Reliability  Analysis  Using  Probabilistic 
Argumentation  Systems 

4.1  Reliability  based  on  Requirement 
Specification 

We  discuss  now  how  probabilistic  argumentation  systems  can 
be  used  to  formulate  and  solve  reliability  problem.  The  ba¬ 
sic  idea  is  simple:  The  system  behavior  is  described  in  terms 
of  the  states  of  its  components.  In  addition  the  desired  or  re¬ 
quired  behavior  of  the  system  is  specified.  The  system  descrip¬ 
tion  forms  a  probabilistic  argumentation  system.  The  ques¬ 
tion  is  then:  how  likely  (probable)  is  it,  that  the  specified 
requirement  is  satisfied?  In  order  to  answer  this  question,  the 
specification  of  required  behavior  is  taken  as  a  hypothesis. 
The  support  of  this  specification  determines  then  essentially 
the  structure  function  of  this  reliability  problem,  and  the  de¬ 
gree  of  support  of  the  specified  requirement  is  the  reliability 
of  the  system  with  respect  to  the  required  behavior.  Note 
that  —  depending  on  different  goals  a  system  should  attain, 
or  services  it  should  provide  —  different  requirements  may 
be  formulated.  So  the  corresponding  reliability  analysis  has 
to  be  differentiated,  but  can  be  carried  out  within  the  same 
framework  of  probabilistic  argumentation  systems. 

Example  3:  ( Cont.  of  Example  1 ) 

We  have  already  formulated  E  and  two  different  specifications 
(Si  =  -i  m  — >  out  and  S2  =  -1  in  <->  out.  We  can  compute  the 
supports  of  these  two  specifications.  It  turns  out,  that  both 
are  the  same, 

sp(<Si,  E)  =  sp{&2,  E)  =  (ok a  A  okc )  V  (okB  A  okc )• 


sp(5i,T,')  =  (ok a  A  ofcc)  V  (okB  A  okc), 

sp(62,E')  =  okA  A  okB  A  okc ■ 


Now,  the  stronger  requirement  62  can  only  be  guaranteed  if 
all  three  components  work  correctly  (a  series  system),  whereas 
the  weaker  one  still  has  the  same  redundancy  as  before.  © 

In  the  general  case,  we  have  a  PAS  (E,  A,  P),  where  the 
assumable  symbols  in  A  correspond  to  the  components  of  the 
system.  Positive  assumptions  correspond  to  working  compo¬ 
nents.  Accordingly  in  the  context  of  reliability  analysis,  we 
shall  call  the  scenarios  of  this  argumentation  system  system 
states.  The  propositional  symbols  in  P  are  needed  to  describe 
the  system  behavior.  We  assume  that  the  system  descrip¬ 
tion  E  excludes  no  system  states,  that  is  there  are  no  con¬ 
flicts,  QSa( _L,  E)  =  0.  A  knowledge  base  E  which  satisfies 
this  is  called  A-consistent. 

The  required  behavior  is  specified  by  a  formula  S.  Usually  5 
will  not  contain  assumptions,  but  there  is  no  reason  to  exclude 
this  in  general.  5  formulates  a  reliability  goal.  There  may  be 
several  such  goals. 

The  set  of  system  states  SPa(S,  E)  supporting  S  contains  all 
states  guaranteeing  the  required  specification  from  the  sys¬ 
tem  description.  Its  complement  SPAC(S,  E)  =  PLa(^S,  E) 
contains  the  system  states  where  this  guarantee  is  no  more 
assured.  These  are  the  unreliable  states.  So  SPA(S,  S)  defines 
the  structure  function  associated  with  the  specification  5 


s  =  <j>sM  s) 


T  if  s  e  SPA(S,  S), 
A  if  s  ^  SPA(S,  E). 


The  index  E  in  <j>g will  be  omitted  if  E  is  clear  from  the  con¬ 
text.  Here,  s  denotes  the  “system  state” ,  which  is  T,  when  the 
reliability  specification  is  assured  and  T  otherwise.  Since  the 
set  SPA  (S,  E)  has  a  logical  representation  based  on  minimal 
arguments,  the  same  holds  for  the  structure  function  (j>g , 


4>s  =  \J  ot  =  sp(S,  E)  (7) 

o6mSP(<S,E) 

In  the  same  way,  based  on  minimal  possible  arguments 
PL(—*5,  E),  we  obtain 

"■&  =  V  P  =P‘H,  E). 


Note  that  this  is  just  the  path  representation  of  the  expected 
structure  function.  In  fact  this  structure  function  could  be 
reformulated  as  (ok a  V  okB)  A  okc ,  which  shows  that  it  is  a 
series  system  composed  of  component  C  and  a  parallel  module 
of  the  components  A  and  B.  The  remarkable  fact  is,  that  this 
structure  function  has  been  automatically  deduced  from  the 
system  description  and  the  specification  of  requirements. 

The  system  description  is  an  essential  element  for  this  anal¬ 
ysis.  If  it  is  changed,  then  this  may  influence  the  results  of  the 
analysis.  Suppose  that,  in  contrast  to  the  model  above,  we  do 
not  know  how  the  faulty  components  behave.  The  knowledge 
base  becomes  now 

yr  _  (  okA  — *  (in  <->  -1*1),  okB  — *  (in  -ix2),  1 
(  okc  — >  (out  <->  xi  V  X2).  / 


By  de  Morgan  laws  this  transforms  into 

4>s  =  A  T®-  (8) 

0epPL(-.5,S) 

Note  that  — 1/3,  the  negation  of  a  term,  is  a  clause.  This  is  a 
second  logical  representation  of  f>s. 

A  comparison  with  the  minimal  path  and  minimal  cut  rep¬ 
resentation  of  monotone  structure  functions  (2)  shows  that 
minimal  supporting  arguments  a  for  <5  and  minimal  possible 
arguments  (3  for  -u5  play  a  role  similar  to  minimal  paths  and 
minimal  cuts. 

According  to  our  assumption  of  A-consistency,  we  have 
QSa(E,  E)  =  0.  Thus 


SPA (S,  S)  =  0SA(1,SUH}). 


(9) 


On  the  other  hand,  we  have  also 

PLA{->5,  E)  =  QS/(L.  E  U  {-«$}).  (10) 

This  shows,  that  a  reliability  analysis  of  a  system  E  relative 
to  a  requirement  specification  <5,  requires  essentially  the  com¬ 
putation  of  the  conflict  states  QSA(-L,  EU{-k5}).  We  shall  see 
below,  that  this  is  exactly  also  what  is  required  for  diagnosis. 
This  is  a  first  hint  to  the  duality  between  the  problems  of 
reliability  and  diagnosis. 

Once  probabilities  for  the  assumptions,  i.e.  component 
availabilities  or  reliabilities  are  defined,  system  reliability  rel¬ 
ative  to  a  specification  <5  is  simply  the  degree  of  support  of  8, 
(since  QSA(- L,  E)  =0),  i.e. 

Ps, e  =  dsp(8,  E)  =  dqs(S,  E)  =  p(QSA(. L,  E  U  {-k5})). 

4.2  Implicitly  Defined  Reliability 

A  specification  5  is  called  consistent  with  the  system  descrip¬ 
tion  E,  if  the  system  state  1  belongs  to  SPA(S,  E).  In  this  sec¬ 
tion  we  only  consider  specifications  consistent  with  the  system 
description. 

A  system  description  E  often  contains,  besides  assumptions, 
another  set  O  of  special  propositional  atoms,  namely  those 
which  are  observable.  Then  specifications  S  can  be  assumed 
to  be  formulated  with  observables  only,  5  £  Co-  Observables 
are  typically  input  and  output  variables  of  some  system. 

Assume  now,  that  in  a  system  description  (E,P,  A)  a  set 
of  observable  variables  O  is  singled  out.  Usually,  O  C  P,  i.e. 
component  states  can  not  be  observed  directly.  But  it  does  no 
harm  to  assume  more  generally  O  C  PUA  Then  we  define 
an  implicit  specification 

5  =  Elim(Aup)-oOI  U  {ai  A  a?  A  •  •  •  A  a„ }). 

That  is,  5  represents  all  the  functionality  of  the  system  in 
terms  of  observables  which  can  be  obtained  from  a  system 
with  all  components  working.  We  call  this  the  implicit  relia¬ 
bility  specification  with  respect  to  O.  Now,  the  system  may  be 
with  respect  to  this  specification  —  as  good  as  “new”  also 
for  some  states  including  faulty  components.  Therefore  we 
define  the  implicit  structure  function  by  the  set  of  up-states 
relative  to  5,  i.e.  PPa(<5,S).  Hence,  we  obtain 

<t>S  =  V  Q’  0r  ^<5  =  A 

ae/iSP(<5,£)  PepPL(^S,T.) 

Accordingly,  the  implicit  reliability  of  such  a  system  can  be 
obtained  as  the  degree  of  support  dsp{8,'C).  This  approach 
helps  to  decide  whether  a  system  has  some  implicit  redun¬ 
dancy,  namely,  whether  (j>g  represents  simply  a  series  system, 
i.e.  p.SP(8,  E)  has  only  the  set  of  all  assumptions  as  minimal 
supporting  argument  for  8. 

Lemma  2  If  8  £  Co  is  a  consistent  specification  with  respect 
to  E,  then  8  |=  (5.1 

This  shows  that  8  is  the  most  stringent,  consistent  speci¬ 
fication  over  observables  O.  For  all  specifications  over  O  the 
implicit  specification  has  least  reliability: 

1  For  proofs  see  [6]. 


Lemma  3  If  8  £  Co  is  a  consistent  specification  with  respect 
to  S,  then  SPa{5,  E)  C  SPa{8,  E). 

Corollary  4  If  8  £  Co  is  a  consistent  specification  with  re¬ 
spect  to  E,  then  Pg  <  ps- 

5  Model-Based  Diagnostic 

5.1  Duality  Between  Reliability  and 
Diagnostics 

A  problem  of  diagnostics  arises  if  an  observation  indicates 
that  a  requirement  specification  8  is  violated.  Then  the  ques¬ 
tion  is:  how  can  the  required  functionality  be  recovered?  That 
is,  one  would  like  to  find  out  those  components  whose  fail¬ 
ure  caused  the  system  failure  and  which  have  to  be  fixed  or 
replaced.  This  analysis  will  be  based  on  the  system  descrip¬ 
tion  E  and  on  the  specification  8  which  is  violated. 

In  fact,  we  ask,  which  system  states  are  compatible  or  con¬ 
sistent  with  the  system  description  E  and  the  violation  of  the 
specification  8,  expressed  by  -i 8.  Well,  these  are  of  course  all 
states  which  are  consistent  with  E  U  { — ><5},  that  is  the  set 

QSac(±,  EU  {-5})  =  PLa (-<5,  S).  (11) 

Remark  that  this  is  exactly  the  set  of  down  states  relative  to 
the  specification  5  (see  (10)).  Here  we  have  the  basic  duality 
between  reliability  analysis  relative  to  a  requirement  speci¬ 
fication  8  and  the  diagnostic  problem  relative  to  the  same 
specification.  The  conflict  set  QSA(C,  E  U  {-i<5})  is  the  com¬ 
putational  key  to  both  reliability  analysis  and  diagnostics.  It 
gives  the  up-states  which  define  reliability  and  its  complement 
gives  the  possible  states  explaining  the  violation  of  the  relia¬ 
bility  specification,  i.e.  possible  diagnostics.  It  is  well  known 
in  model-based  diagnostics  that  such  conflict  sets  play  a  key 
role  [23,  10,  19].  The  duality  implies  that  they  play  an  equally 
important  role  for  model-based  reliability. 

If  the  structure  function  (ps,iz  is  monotone ,  then  to  the  min¬ 
imal  possible  arguments  (3  £  pPL{—i8,  E)  correspond  the  min¬ 
imal  cuts  — 1/3.  They  represent  minimal  sets  of  failed  compo¬ 
nents,  which  explain  the  violation  of  the  specification  <5,  inde¬ 
pendently  on  the  state  of  the  other  components. 

Minimal  cuts  correspond  to  kernel  diagnoses  in  model- 
based  diagnostics  [23].  Usually  model-based  diagnostics  goes 
not  beyond  such  concepts  of  diagnostics.  It  neglects  the  im¬ 
portant  role  of  probability.2  The  observation  of  the  violation 
of  the  specification  — k5  in  fact  defines  the  event  QSac(A. ,  E  U 
{-k5})  in  the  sample  space  NA ■  That  is,  the  prior  probabilities 
p(s)  defined  on  the  states  have  now  to  be  conditioned  on  this 
event.  This  gives  us  the  posterior  probabilities 

r,(s|-i<5)  =  _ (12) 

p(]0>  1-P(QSA(C,CU{^8}))  dplH,E)’  1  j 

for  diagnostic  states  s  G  QSA(J L,E  U  {->£}).  This  underlines 
once  more  the  key  role  of  the  conflict  set  QSa(A.,  EU {-■£}).  Its 
prior  probability  is  sufficient  to  compute  the  posterior  proba¬ 
bilities  of  the  possible  diagnostic  states  explaining  the  viola¬ 
tion  of  5. 

2  See  however  [19,  3]  for  a  discussion  of  this  subject,  and  es¬ 
pecially  [19]  for  the  problems  of  the  approach  of  De  Kleer  & 
Williams  [11].  Other  approches  focus  for  example  on  minimal 
entropy  [26]  or  on  restricting  the  device  to  have  a  Bayesian  net¬ 
work  model  [17]. 


These  posterior  probabilities  represent  important  addi¬ 
tional  diagnostic  information.  For  example  we  may  look  for  di¬ 
agnostic  states  with  maximal  posterior  probability,  s  is  called 
a  maximal  likelihood  state,  if 

p(s|-i<5)  =  max  p(s|-i<5).  (13) 

s6QSac(J-.£U{-«}) 

There  may  be  several  such  states.  They  represent  most  likely 
states  explaining  the  violation  of  <5. 

Reiter  [23]  proposed  to  look  especially  at  possible  diagnos¬ 
tic  states  with  a  minimal  number  of  faulty  components.  In¬ 
tuitively  this  makes  sense:  The  failure  should  be  explained 
with  a  minimal  number  of  down  components.  If  s  is  a  state, 
we  define  s~  to  be  the  set  of  its  negative  (down)  compo¬ 
nents.  Then  we  define  a  partial  order  between  states:  s'  <  s 
if  s'-  Cs“.  Reiter  diagnoses  are  now  those  diagnostic  states 
s  £  QSac(-L,Y,  U  { — '<5}) ,  which  are  minimal  with  respect  to 
this  partial  order.  We  make  the  reasonable  assumption  that 
for  every  component  i  we  have  pi  >  0.5  such  that  pi  >  1  —  pi. 
I.e.  it  is  more  probable  that  a  component  works  than  that 
it  is  down.  Then  s'  <  s  implies  that  p(s'|-n5)  >  p(s|-i<5).  So 
maximum  likelihood  states  are  Reiter  diagnoses.  The  inverse 
of  course  does  not  hold  necessarily.  Also,  if  the  structure  func¬ 
tion  rf>s  is  monotone,  the  s~  of  Reiter  diagnoses  correspond  to 
minimal  cuts  relative  to  the  specification  <5. 

The  posterior  fault  probabilities  of  the  components, 
p>( — ia.i | — k5) ,  are  also  of  interest.  The  larger  this  probability, 
the  more  critical  is  component  i  for  the  requirement  specifi¬ 
cation  S.  So  this  is  a  possible  importance  measure  for  com¬ 
ponent  i  relative  to  the  specification  (for  other  importance 
measures  see  [4]). 

Example  f:  (Cont.  of  Example  1) 

Suppose  we  observe  that,  although  -i in,  we  have  also  -i out, 
i.e.  a  power  system  failure  is  not  detected.  Note  that  ->in  A 
-<out  =  -i(5i  (cf.  Example  3).  So  we  consult  the  minimal  cuts 
relative  to  the  specification  -x5i .  There  are  two  minimal  cuts: 
{-iofcc}  and  {-ioA:a,  _1ofcs}.  To  any  minimal  cut  corresponds 
a  Reiter  diagnosis,  namely,  { okA,okB,~'okc }  to  the  first  cut, 
and  {-i okA,~iokB,okc}  to  the  second  one.  One  of  these  two 
diagnoses  must  be  the  maximum  likelihood  state.  The  first  one 
has  prior  probability  0.99  x  0.99  x  0.05  =  0.049,  the  second  one 
0.01  x  0.01  x  0.95  =  0.000095.  So  clearly,  the  first  one  is  by  far 
the  most  likely  state.  The  posterior  probability  is  obtained  by 
dividing  the  prior  probability  by  the  unreliability  0.05  relative 
to  <5i .  We  obtain  for  the  maximum  likelihood  state  a  posterior 
probability  of  0.98.  © 

5.2  Diagnostics  Based  on  Observations  of 
System  Behavior 

The  actual  observation  is  not  necessarily  the  negation  of  a  sys¬ 
tem  requirement,  but  may  be  something  stronger,  which  im¬ 
plies  the  violation  of  a  specification.  Indeed,  as  we  saw  in  Ex¬ 
ample  4  we  observed  -iinA~<out  =  ->Si,  but  ->in/\-^out  |=  -ifo. 
So,  we  should  reconsider  the  duality  between  reliability  and 
diagnostics.  In  fact,  assume  that  we  make  some  observation 
of  the  system  behavior,  expressed  in  a  formula  uj  over  observ¬ 
ables.  Then  we  may  test  whether  uj  |=  -ifc.  If  this  is  the  case, 
then  we  have  a  diagnostic  problem,  in  the  sense  that  at  least 
one  component  must  be  down. 


The  solution  of  this  diagnostic  problem  is  found  along  sim¬ 
ilar  lines  as  in  the  previous  section.  Possible  states  are  those, 
which  are  consistent  with  the  system  description  and  the  ob¬ 
servation.  Or,  in  other  words,  the  states  in  the  conflict  set 
QSa(.L,  E  U  {cu})  are  those  which  are  excluded  by  the  obser¬ 
vation.  So,  the  possible  diagnostic  states  are  those  of  the  set 
PLa{oj,  E)  =  QSac(.L,  E  U  {cu}).  We  see  that  this  diagnostic 
problem  is  dual  to  a  (fictitious)  reliability  problem  with  a  “re¬ 
quirement”  specification  —w.  Note  that  the  specification  —w  is 
always  consistent  with  E,  since  &  is  consistent  and  oj  |=  -ifc- 
Of  course,  we  get  a  much  sharper  diagnostic  with  an  ob¬ 
servation  oj  |=  -i 5,  than  with  the  information  of  -> 8  only.  This 
follows,  because  according  to  Lemma  3,  we  have  PLa(oj,  E)  C 
PLa(S,  E).  So,  the  more  precise  the  observation,  the  more 
states  are  eliminated.  A  mere  statement  that  a  given  reliabil¬ 
ity  specification  is  violated  is  less  informative  than  a  precise 
observation  implying  a  violation  of  a  requirement  specifica¬ 
tion. 

6  Combining  Diagnostic  and  Reliability 

We  conclude  this  discussion  of  duality  between  reliability  and 
diagnostics  by  remarking  that  we  may  have  an  observation  of 
the  system  behavior  which  does  neither  entail  a  specification  5 
nor  its  violation  -«S.  But  still  this  observation  is  information 
and  we  can  use  it  to  improve  reliability  analysis  and  also  to 
perform  a  preventive  diagnostic  analysis  (see  [6]).  For  relia¬ 
bility  as  well  as  for  diagnostic,  additional  measurements  — 
or  more  generally  any  additional  information  —  can  be  taken 
into  account  in  the  framework  presented  above  and  helps  to 
focus  the  reasoning. 

7  Conclusions 

In  this  paper  we  have  shown  how  closely  reliability  and  model- 
based  diagnostic  are  connected.  The  framework  of  probabilis¬ 
tic  argumentation  system  appears  to  be  a  framework  which 
covers  both  approaches.  Therefore  the  generic  structure  of 
PAS  can  be  used  for  solving  problems  in  both  domains.  The 
approaches  can  even  be  combined  and  the  information  spec¬ 
ified  can  be  used  in  the  common  framework.  Further,  from 
the  system  description  of  an  argumentation  system,  we  can 
derive  the  appropriate  structure  function  and  if  desirable 
—  take  into  consideration  a  reliability  requirement.  PAS  al¬ 
lows  to  use  local  computation  architectures  and  approxima¬ 
tion  techniques  [25,  15].  This  complements  the  computational 
theory  of  reliability  theory. 
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Abstract.  Active  systems  are  a  class  of  discrete-event  systems 
modeled  as  networks  of  nondeterministic  automata  communicating 
through  either  synchronous  or  asynchronous  connection  links.  The 
model-based  diagnosis  of  an  active  system  is  carried  out  by  first 
reconstructing  its  behavior  based  on  the  observation,  from  which 
faults  are  later  derived.  The  complexity  of  behavior  reconstruction  is 
exacerbated  by  the  possibility  of  queuing  events  within  links,  thereby 
making  essential  the  simulation  of  the  order  in  which  events  are 
buffered  within  links.  Unfortunately  some  sequences  of  events  may 
lead  to  blind  alleys  in  the  search  space.  This  is  especially  critical  if 
events  exchanged  among  components  are  assumed  to  be  uncertain, 
as  the  number  of  alternative  sequences  of  queued  events  is  still 
larger.  Therefore,  behavior  reconstruction  without  any  prospection 
in  the  search  space  is  generally  bound  to  detrimental  backtracking. 
To  make  diagnosis  of  active  systems  more  efficient,  we  present  an 
off-line  technique  for  processing  the  models  inherent  to  the  system  at 
hand  so  as  to  automatically  generate  prospection  knowledge  relevant 
to  the  mode  in  which  events  are  produced  and  consumed  over  links. 
Such  a  knowledge  is  then  exploited  on-line,  when  the  diagnostic 
engine  is  running,  to  guide  the  search  process,  thus  reducing  both 
time  and  space. 

1  INTRODUCTION 

Diagnosis  of  discrete-event  systems  (DESs)  is  a  complex  and  chal¬ 
lenging  task  that  has  been  receiving  an  increasing  interest  from  both 
the  model-based  diagnosis  community  [9],  within  the  AI  area,  and 
the  fault  detection  and  isolation  (FDI)  community  [16,  8,  10],  within 
the  automatic  control  area.  The  current  shared  prospect  is  that,  in 
the  general  case,  the  specific  faults  cannot  be  inferred  without  first 
finding  out  what  has  happened  to  the  system  to  be  diagnosed.  Once 
the  system  evolution  is  available,  the  sets  of  candidate  faults  can  be 
derived  from  it. 

In  this  respect,  in  spite  of  slightly  different  terminologies,  such 
as  histories  [2].  situation  histories  or  narratives  [4],  paths  [5],  and 
trajectories  [11,  6],  all  the  distinct  approaches  describe  the  evolution 
of  a  DES  as  a  sequence  interleaving  states  and  transitions,  as  the 
favorite  behavioral  models  of  DESs  in  the  literature  are  automata. 

Based  on  the  method  for  tracking  the  evolutions  of  the  system  that 
explain  a  given  observation,  two  broad  categories  of  approaches  to 
diagnosis  of  DESs  can  be  basically  singled  out: 

•  Those  that  first  generate  (a  concise/partial  model  of)  all  possible 
evolutions  and  then  retrieve  only  the  evolutions  that  explain  the 
observation; 

•  Those  that  generate  in  one  shot  the  evolutions  explaining  the 
observation. 

The  first  category  includes  some  relevant  works  from  both  the 
automatic  control  area  [19,  20,  7,  15]  and  the  AI  area  [12,  6]. 
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Embodied  in  the  second  category  are  some  approaches  of  the  AI 
area  [2,  11,  17], 

Since  finding  out  the  system  evolutions  is  a  computationally  ex¬ 
pensive  and,  therefore,  inefficient  process  (see,  for  instance,  [18] 
about  the  computational  difficulties  of  the  diagnoser  approach 
[19,  20].  or  the  worst  case  computational  complexity  analysis  in 
[2],  or  the  discussion  in  [11]),  most  of  the  approaches  exploit  a 
trade-off  between  off-line  and  on-line  computation. 

Focusing  on  the  second  category  outlined  above,  the  decentral¬ 
ized  diagnoser  approach  [17]  draws  off-line  a  local  diagnoser  for 
each  component.  Such  a  diagnoser  is  an  automaton  whose  states 
and  (observable)  transitions  are  labeled  with  compiled  knowledge 
about  unobservable  paths  and  interacting  components,  respectively. 
Each  local  diagnoser  is  employed  on-line  for  both  a  more  efficient 
reconstruction  of  all  the  possible  evolutions  of  the  relevant  compo¬ 
nent  that  comply  with  the  observation  and  a  more  efficient  merging 
of  the  histories  of  distinct  components  into  global  system  histories. 

This  paper  applies  knowledge  compilation  to  the  active  system 
approach  [2,  3],  to  which  purpose  it  isolates  a  kind  of  knowledge, 
implicit  in  the  models  of  the  structure  and  behavior  of  the  system 
at  hand,  that  can  be  compiled  off-line  in  order  to  speed  up  on-line 
execution.  The  framework  is  that  of  active  systems,  a  class  of  DESs 
modeled  as  networks  of  nondeterministic  automata  communicating 
through  directed  links.  If  an  active  system  includes  one  or  more 
asynchronous  buffered  links,  its  reaction  to  an  event  coming  from 
the  external  world  is  assumed  to  continue  until  there  is  no  event 
left  in  the  links.  The  component  that  sends  events  on  a  link  is  the 
event  producer  and  that  extracting  them  from  the  link  is  the  con¬ 
sumer.  The  knowledge  we  compile  is  actually  that  inherent  to  the 
producer-consumer  relationships  between  components.  In  particular, 
we  present,  by  means  of  an  example: 

•  An  extension  of  both  the  modeling  primitives  and  the  on-line 
‘short-sighted’  evolution  reconstruction  method  so  as  to  cope  with 
uncertain  events; 

•  A  method  for  generating  off-line,  under  the  form  of  a  determinis¬ 
tic  automaton,  called  a  prospection  graph ,  the  model  of  the  way 
events  are  exchanged  over  one  or  more  links; 

•  A  ‘far-sighted’  method  for  exploiting  prospection  graphs  on-line 
while  reconstructing  the  evolutions  of  (sub)systems. 

Finally,  the  computational  advantages  of  far-sighted  diagnosis  are 
discussed  and  some  conclusions  are  hinted. 

2  ACTIVE  SYSTEMS  WITH  UNCERTAIN 
EVENTS 

Topologically,  an  active  system  S  is  a  network  of  components  which 
are  connected  to  one  another  through  links.  Each  component  is  com¬ 
pletely  modeled  by  an  automaton  which  reacts  to  events  either  com¬ 
ing  from  the  external  world  or  from  neighboring  components  through 
links.  Formally,  the  automaton  is  a  6-tuple 

(S,Ein.I,Eout,0,T) 


where  S  is  the  set  of  states ,  Ejn  the  set  of  input  events,  I  the  set  of 
input  terminals,  Eout  the  set  of  output  events,  O  the  set  of  output 
terminals,  and  T  the  (nondeterministic)  transition  function : 

T  :  S  x  Ein  x  I  x  2Eout x°  ^  2s. 

A  transition  from  state  S  to  state  S',  which  is  triggered  by  the 
input  event  a  =  ( E ,  I),  E  €  Ein,  I  £  I,  and  generates  the  set  /3  = 
{(Ei,  Of),...,  (Erl,  On)}  of  output  events,  Ek  €  Eout,  Ok  €  O, 
k  €  [1 ..  n],  is  denoted  by 
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Components  are  implicitly  equipped  with  three  virtual  terminals, 
the  standard  input  (In  €  I)  for  events  coming  from  the  external 
world,  the  standard  output  ( Out  €  O')  for  events  directed  toward 
the  external  world  (messages),  and  the  fault  terminal  ( Fit  €  O)  for 
modeling  faulty  transitions. 

An  event  (E,  Fit)  is  a  faidt  event.  The  approach  assumes  that 
both  nominal  and  faulty  behavior  of  each  component  are  specified  in 
the  automaton.  A  fault  event  is  not  exchanged  among  components. 
Rather,  it  is  a  formal  artifice  to  describe  the  faulty  behavior  of 
components  uniformly.  The  name  of  fault  events  are  supposed  to  be 
informative  as  to  the  specific  fault  affecting  the  component  when 
the  relevant  transition  is  performed2. 

An  event  may  be  uncertain  in  nature,  that  is,  represented  by  a 
disjunction  of  possible  values.  Links  are  the  means  of  storing  the 
events  exchanged  between  components. 

Each  link  L  is  characterized  by  a  4-tuple 


{I,0,X,P) 


where  /  is  the  input  terminal  (connected  with  a  component  output 
terminal),  O  the  output  terminal  (connected  with  a  component  input 
terminal),  \  the  capacity,  that  is,  the  maximum  number  of  queued 
events,  and  P  the  saturation  policy,  which  dictates  the  effect  of  the 
triggering  of  a  transition  T  attempting  to  insert  a  new  event  E  into 
L  when  L  is  saturated,  that  is,  when  the  length  of  the  queue  equals 
X-  Three  cases  are  possible: 

•  LOSE :  E  is  lost; 

•  OVERRIDE :  E  replaces  the  last  event  in  the  queue  of  L; 

•  WAIT.  T  cannot  be  triggered  until  L  becomes  unsaturated,  that 
is,  until  at  least  one  event  in  L  is  consumed. 

The  queue  domain  Q  of  L  is  the  set  of  possible  sequences 
(queues)  of  events  in  L.  The  length  of  the  queue  Q  of  events  incor¬ 
porated  in  L  is  denoted  by  \Q\. 

The  polymorphic  Link  function  is  defined  as  follows.  Let 

a  =  (E,0) 

represent  an  event  relevant  to  a  terminal  9.  Then, 

Link(a)  ^  L  \  L  is  the  link  connected  with  6. 

No  more  than  one  link  can  be  connected  with  a  component  terminal. 
If  9  is  a  virtual  terminal,  then  Link(a)  =f  null.  Let 

p  =  {(E1,e1),...,(En,en)} 

2  For  example,  consider  a  breaker  which  is  in  the  state  open  and  is  expected 
to  change  state  to  close  when  it  receives  a  command  (nominal  behavior). 
The  possible  misbehavior  of  the  breaker  can  be  defined  by  inserting  a 
faulty  transition,  from  state  open  to  open,  that  generates  the  fault  event 
( stuckToOpen ,  Fit). 
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Figure  1.  System  T  and  models  of  components  X  (top)  and  Y  (bottom). 


be  a  set  of  events  relevant  to  terminals  9i,  i  €  [1 ..  n],  respectively. 
Then, 

Link(J3)  d=  {Lp  \  Lp  =  Link(£),£  €  /?}. 

Initially,  E  is  in  a  quiescent  state  So,  wherein  all  links  are  empty.  At 
the  arrival  of  an  event  from  the  external  world,  E  becomes  reacting, 
thereby  making  a  series  of  transitions  until  a  final  quiescent  state 
is  reached,  wherein  all  links  are  empty  anew.  This  reaction  yields 
a  sequence  of  observable  events,  the  messages,  which  make  up  a 
system  observation  OBS(E). 

Let  So  denote  the  initial  state  of  system  E.  Based  on  a  diagnostic 
problem 

p(S)  =  (OBS(E),E0) 

a  reconstruction  of  the  system  reaction  is  carried  out,  which  yields  an 
active  space,  that  is,  a  graph  representing  the  whole  set  of  candidate 
histories,  each  history  being  a  path  from  So  to  a  final  state,  in 
other  terms,  a  sequence  of  component  transitions  which  explains 
OBS{  E). 

Candidate  diagnoses  are  eventually  distilled  from  the  active  space, 
each  diagnosis  being  a  set  of  faulty  components,  that  is,  those  com¬ 
ponents  which  made  at  least  one  faulty  transition  during  a  candidate 
system  history. 

Example  1.  Displayed  in  the  center  of  Figure  1  is  a  system  'L, 
where  A'  and  Y  are  components,  while  L i  and  L2  are  links.  Both 
components  are  endowed  with  an  input  terminal  /  and  an  output 
terminal  O.  For  both  links  we  assume  \  =  1  and  P  =  WAIT.  The 
behavioral  models  of  A'  and  Y  are  displayed  on  the  top  and  on  the 
bottom,  respectively.  Accordingly,  Y  involves  three  states  (Yj  •  •  • 
Yj)  and  four  transitions  (y  1  •  •  •  yf),  one  of  which  is  faulty  (yf) 
(states  and  transitions  are  denoted  by  capital  and  small  letters,  re¬ 
spectively).  For  instance,  transition  2/4  is  triggered  by  the  input  event 
(es ,1)  and  generates  the  set  of  output  events  {(e2 ,0),(d.  Out)}, 
where  the  former  is  directed  toward  A'  on  link  L2,  while  the  latter 
is  a  message  labeled  d  (1/4  is  said  to  be  obsen’able).  Transition  2/2 
involves  the  input  event  ({ei,  e3 },I),  meaning  that  2)2  may  either  be 
triggered  by  e\  or  e3.  Considering  the  model  of  X,  note  that,  when 
triggered,  transition  £3  generates  the  uncertain  event  ({e3,  es},  O), 
meaning  that  either  e3  or  es  is  randomly  generated  (no  assump¬ 
tion  is  made  about  the  likelihood  of  event  generation).  Likewise,  xq 
generates  the  uncertain  event  ({e3,e},0),  meaning  that  either  e3 
or  nothing  is  generated  (e  denotes  the  null  event).  □ 


3  SHORT-SIGHTED  DIAGNOSIS 


The  main  task  relevant  to  the  resolution  of  a  diagnostic  problem 
p( E)  =  (OBS(E),  So)  is  the  reconstruction  of  the  system  reaction 
to  make  up  the  relevant  active  space  Act(p(T,)).  A  node  N  in  the 
search  space  is  identified  by  three  fields,  N  =  (o,  $y,  Q),  where: 

•  o  =  (Si, . . .  ,Sn)  is  the  record  of  states  of  the  system  compo¬ 
nents,  each  Si,  i  €  [1 ..  n],  being  a  state  relevant  to  a  component 
Ci  in  S  (n  is  the  number  of  components  in  S); 

•  $y  is  the  index  of  OBS( S),  that  is,  an  integer  ranging  from  0  to 
the  number  of  messages  (length)  of  OBS( S),  which  implicitly 
denotes  the  prefix  of  the  observation  composed  of  the  first  $y 
messages; 

•  Q  =  (Qi, . . . ,  Qi)  is  the  record  of  queues  of  the  l  links  in  E. 

Node  N  is  said  to  be  final  when  $y  equals  the  length  of  OBS(  E)  and 
all  links  are  empty.  The  search  for  the  nodes  of  the  active  space  is 
started  at  the  initial  node  No  =  (Eo,  0,  ((}, . . . ,  ())),  where  all  link 
queues  are  empty.  Each  successor  node  of  a  given  node  is  obtained 
by  applying  a  component  transition  that  is  consistent  with  both  the 
system  topology  and  the  observation.  An  applied  transition  is  an 
edge  of  the  search  space.  When  the  reconstruction  process  is  carried 
out  in  one  step  ( monolithically )  without  any  prospection  knowledge 
(short-sightedly),  it  can  be  described  by  Algorithm  1,  where  nodes 
and  edges  generated  during  the  search  are  stored  in  variables  N  and 
£,  respectively. 

Algorithm  1.  ( Short-sighted  Reconstruction) 

1.0  =  {-/Vo};  £  =  0;  (N0  is  unmarked) 

2.  Repeat  Steps  3  through  5  until  all  nodes  in  M  are  marked ; 

3.  Get  an  unmarked  node  N  =  (o,  A,  Q)  in 

4.  For  each  i  in  [l..n],  for  each  transition  T  within  the  model 
of  component  Ci,  if  T  is  trigger-able,  that  is,  if  its  triggering 
event  is  available  within  the  link  and  T  is  consistent  with  both 
OBS(E)  and  the  link  policy  (when  T  generates  output  events 
on  non-virtual  terminals),  do  the  following  steps: 

(a)  Create  a  node  (N'  =  (o' ,  2s' ,  Q'))  :=  N;  (N'  is  created  as  a 
copy  of  N) 

(b)  o'[i\  :=  the  state  reached  by  T; 

(c)  IfT  is  observable,  then  $y  :=  $y  +  1;  (a  message  is  generated) 

(d)  If  the  triggering  event  E  of  T  is  relevant  to  an  internal  link 
Lj,  then  remove  E  from  Q'[j\; 

(e)  Insert  the  internal  output  events  ofT  into  the  relevant  queues 
in  Q! : 

(f)  If  N'  N  then  insert  N'  into  ( N '  is  unmarked) 

(g)  Insert  edge  N  N’  into  £; 

5.  Mark  N; 

6.  Remove  from  M  all  the  nodes  and  from  £  all  the  edges  that  are 
not  on  a  path  from  the  initial  state  No  to  a  final  state  in  M. 

The  algorithm  aims  to  make  up  all  the  nodes  which  are  reachable 
from  the  initial  node  under  the  given  observation.  To  this  end,  it  con¬ 
siders,  one  at  a  time,  all  the  nodes  which  have  been  reached  already 
(those  in  M)  and  have  not  yet  been  processed  (the  unmarked  ones). 
For  each  of  them,  it  attempts  to  find  a  transition  that  is  triggerable 
by  a  component  in  the  corresponding  state.  If  so,  it  generates  the 
target  node  N'  with  the  appropriate  values  o' ,  2s',  and  Q’ .  In  the 
new  node  was  not  created  already,  it  is  inserted  into  M  (note  that  two 
nodes  which  differ  in  the  $y  field  only  have  to  be  considered  dif¬ 
ferent,  as  the  mode  in  which  messages  have  been  generated  differ). 
The  corresponding  edge  N  N’  is  inserted  into  £  too.  Finally, 


Figure  2.  Short-sighted  reconstruction  space  (see  Example  2). 


when  there  are  no  more  nodes  to  be  processed  (all  nodes  in  N  are 
marked),  the  search  space  is  pruned  by  eliminating  the  inconsistent 
nodes,  that  is,  those  that  are  on  a  blind  alley. 

It  is  worthwhile  highlighting  that  the  search  process  does  not 
terminate  at  a  final  node.  In  fact,  the  system  might  continue  to  react 
and  loop  on  unobservable  paths.  In  other  words,  when  a  final  node 
is  met  in  the  search,  it  is  inserted  into  N  as  an  unmarked  node  like 
all  other  nodes,  since  in  principle,  unobservable  paths  might  happen 
to  leave  it. 

When  uncertain  output  events  are  involved,  several  new  nodes 
N1  are  to  be  generated  for  the  same  transition  T,  specifically,  one 
for  each  combination  of  possible  values  within  each  disjunction. 
For  example,  since  transition  *3  in  Figure  1  involves  the  uncertain 
output  event  ({e3,  e$},  O ),  two  target  nodes  will  be  generated,  one 
for  e3  and  one  for  es.  If  the  set  of  output  events  included  several 
uncertain  events,  all  possible  combinations  would  be  required  to  be 
enumerated. 

Example  2.  Shown  in  Figure  2  is  the  reconstruction  space  generated 
short-sightedly  for  the  diagnostic  problem  p(\P)  =  ( OBS( T),  To), 
where  T  is  the  system  outlined  in  Figure  1,  OBS(^)  =  (a,  b,  c,  d), 
and  To  =  (X\ ,  Yi ).  Each  node  is  depicted  by  an  ellipse,  wherein 

•  o  =  ( Xt ,  Yj )  is  the  pair  of  component  states; 

•  $y  is  the  prefix  of  the  observation  generated  so  far; 

•  Q  =  (Q i,  Q2)  is  the  pair  of  link  queues. 

Edges  are  marked  by  the  corresponding  component  transitions,  pos¬ 
sibly  qualified  by  the  relevant  chosen  label  when  the  involved  out¬ 
put  event  is  uncertain.  Dotted  edges  denote  faulty  transitions.  Final 
nodes  are  depicted  as  double  ellipses.  The  dashed  part  of  the  graph 


corresponds  to  inconsistent  states,  which  are  almost  half  the  search 
space.  Owing  to  cycles  in  the  graph  (edges  marked  by  X2),  the 
active  space  includes  an  unbound  number  of  candidate  histories. 
However,  only  two  candidate  diagnoses  are  possible,  namely  {Y} 
and  {X,Y}.  Note  that,  although  not  relevant  to  our  example,  the 
replication  of  the  same  faulty  transition  in  a  cycle  does  not  change 
the  diagnosis.  A  finer-grained  diagnosis  can  be  defined,  as  in  [2], 
called  deep  diagnosis.  The  latter  is  a  set  of  pairs  (C,  /),  where  C  is 
a  component  and  /  a  fault  event.  This  way,  even  if  not  relevant  to 
our  example  where  each  component  model  includes  a  single  faulty 
transition,  it  is  possible  to  know  all  the  faulty  transitions  performed 
by  each  misbehaving  component.  □ 

4  FAR-SIGHTED  DIAGNOSIS 

The  essential  problem  with  short-sighted  diagnosis  lies  in  the  lack 
of  any  prospection  in  the  search  space  as  to  the  consistency  of  the 
link  queues.  In  other  words,  the  inability  to  understand  that  a  given 
configuration  of  Q  is  bound  to  a  ‘blind  alley’  forces  the  reconstruc¬ 
tion  algorithm  to  uselessly  explore  possibly  large  parts  of  the  search 
space.  In  order  to  overcome  this  limitation,  prospection  knowledge 
can  be  automatically  generated  off-line  based  on  the  system  model. 
Considering  Figure  2,  such  a  knowledge  will  allow  the  reconstruc¬ 
tion  process  to  avoid  entering  the  inconsistent  sub-space  through 
2/2. 

The  basic  idea  is  to  view  a  link  L  as  a  buffer  in  which  a  producer 
component  Cp  generates  events  that  are  consumed  by  a  consumer 
component  Cc.  That  is,  L  connects  an  output  terminal  of  Cp  to  an 
input  terminal  of  Cc.  The  way  events  are  produced  and  consumed 
in  L  is  both  constrained  by  the  characteristics  of  the  link  (capacity 
and  saturation  policy)  and  the  models  of  Cp  and  Cc. 

4.1  Prospection  graphs 

Let  L  =  (I,  O,  x,  P)  be  a  link  from  output  terminal  Op  of  com¬ 
ponent  Cp  to  input  terminal  Ic  of  component  Cc,  with  queue 
domain  Q.  Let  Mp  =  (Sp,  E?  ,  P,  Eput,  Op,  Tp)  and  Mc  = 
(Sc,Efn,Ic,Ej;ut,Oc,Tc)  be  the  models  of  Cp  and  Cc ,  respec¬ 
tively.  Let 

Mp°  =  (Spn,Epn,Tp°) 

be  the  nondeterministic  automaton  obtained  from  Mp  in  such  a  way 
that 

•  Sp  =  Sp  is  the  set  of  states; 

•  Ep  C  Tp  U  {e}  is  the  set  of  events; 

•  Tp  :  Sp  x  Ep  t— >  2SP  is  the  transition  function. 

The  transition  function  Tp  is  obtained  from  Tp  as  follows: 

VT  =  S^S'GTpl  SAS'Gtpnn  ifL?Link(P) 

}  S  —>  S'  €  Tp  otherwise. 

Similarly,  let 

Mc”  =  (Sc\Ecn,T°n) 

be  the  nondeterministic  automaton  obtained  from  Mc  in  such  a  way 
that 

•  Sc"  =  Sc  is  the  set  of  states; 

•  Ec“  CTcU{e)  is  the  set  of  events; 

•  Tcl1  :  Sc"  x  EcI1  (— >  is  the  transition  function. 

The  transition  function  Tc  is  obtained  from  Tc  as  follows: 

VT  =  5  ^  S'  e  TCI  if  L  ^  Link  (a) 

|  S  ^  S'  €  Tc°  otherwise. 


Let  Mp  =  (SP,EP,TP)  and  Mc  =  (SC,EC,TC)  be  the  de¬ 
terministic  automata  equivalent  to  Mp  and  Mc  ,  respectively.  A 
prospection  state  C  of  L  is  a  triple 

£  =  ( SP,SC,Q )  £  Sp  x  Sc  x  Q. 

Let  £  be  a  prospection  state  and  §—>§'€.  (Tp  U  Tc),  S  € 

{Sp,  Sc},  T  =  S  S'  €  (TPUTC).  Let  Q  be  a  queue  of  events 
in  L  and 

•  Head(Q)  denote  the  first  consumable  event  in  Q; 

•  Tail(Q)  denote  the  sequence  of  events  in  Q  following  the  first 
event; 

•  App(Q,e)  denote  the  queue  obtained  by  appending  e  to  Q\ 

•  Repl(Q,  e )  denote  the  queue  obtained  by  replacing  the  last  event 
in  Q  with  e. 

The  Next  function  yields  the  set  of  next  prospection  states  as 
follows: 

Next(jC,  T)  d=  {£'  |  C!  G  Nextp(C,  T),T  G  TP}U 
{£'  |  £'  G  Nextc(C,  T),  T  G  Tc} 

where 

Nextp(C,  T)  d=  {£'  |  £'  =  (S',SC,Q'),B  =  ( E,Op )  G  P, 
e  G  E,Q'  =  Ins(Q,  e ),  (|Q|  <  x  or 
(|Q|  =  x,  (e  =  e  or  P  G  {LOSE,  OVERRIDE})))}, 

{  App(Q,  e)  if  | Q|  <  x 

Ins(Q,  e)  =  {  Q  if  |<2|  =  X,  (e  =  e  or  P  =  LOSE) 

{  Repl(Q,  e)  if  |Q|  =x,P=  OVERRIDE 

and 

Nextc(£,  T)  d=  {£'  \  £'  =  (SP,S',Q'), 

a  =  ( E ,  Ic),e  G  E ,  Head(Q)  =  e,  Q'  =  Tail(Q)}. 

Let  Co  =  (Sq,Sq)  be  the  pair  of  initial  states  for  Cp  and  Cc, 
respectively.  The  spurious  prospection  graph  of  L  and  Co  is  the 
nondeterministic  automaton 

Vn{L,Co)  =  (Sn,  En,  Tn,  So,  S" ) 

where 

Sn  =  {£  |  £  is  a  prospection  state  of  L}  is  the  set  of  states, 

E11  C  Ep  U  Ec  C  Tp  U  Tc  is  the  set  of  events. 

So  =  (Sp,  So,  (})  is  the  initial  state, 

S“  =  {£  |  £  G  Sn,  £  =  ( Sp ,  Sc,  ())}  is  the  set  of  final  states, 

Tn  :  Sn  xE“h2s  is  the  transition  function  defined  as  follows: 

£  ^  £'  G  Tn  iff  £'  G  Next(£,  T). 

A  state  of  a  spurious  prospection  graph  which  is  not  within  a  path 
from  the  initial  state  to  a  final  state  is  an  inconsistent  state.  Similarly, 
a  transition  entering  or  leaving  an  inconsistent  state  of  a  spurious 
prospection  graph  is  an  inconsistent  transition. 

The  nondeterministic  prospection  graph  is  the  nondeterministic 
automaton 

rn(Z/,C0)  =  (Sn,  En,  Tn,  Sq,  Sf ) 

obtained  from  rn(L,  Co)  by  removing  inconsistent  states  and  incon¬ 
sistent  transitions. 

The  prospection  graph 

T(L,Co)  =  (S,  E,  T,  So,  Sf) 

is  the  deterministic  automaton  equivalent  to  the  nondeterministuic 
prospection  graph  Tn(L,Co). 


Figure  3.  Generation  of  rn(Li,  (A'i,  Yi))  (see  Example  3). 


Example  3.  Shown  in  the  dashed  box  of  Figure  3  are  the  prospec- 
tion  models  MP(X)  (top)  and  MC(Y)  (bottom),  inherent  to  link 
L  i,  which  are  relevant  to  the  components  A'  and  Y  displayed  in 
Figure  1.  Depicted  on  the  top  of  the  box  is  the  nondeterministic 
automaton  Mp  (.A)  equivalent  to  Mp( X).  The  generation  of  the 
nondeterministic  prospection  graph  Fn(Li,  [X] .  IT))  is  outlined  on 
the  right  of  Figure  4,  where  double  ellipses  denote  final  states,  while 
dashed  nodes  and  edges  represent  inconsistent  states  and  transitions, 
respectively.  Note  that  the  latter  includes  a  circular  path  involving 
four  states.  This  situation  is  similar  to  that  of  active  systems,  where 
cycles  may  stem  from  (possibly)  final  states.  Within  the  context  of 
prospection  graphs,  cycles  represent  repetitive  patterns  of  link  state 
changes  (in  our  example,  events  e3  and  e$  are  repeatedly  produced 
and  consumed,  that  is,  inserted  into  and  removed  from  link  Li).  □ 

Note  that,  essentially,  the  generation  of  a  prospection  graph  is  anal¬ 
ogous  to  the  generation  of  an  active  space,  where 

•  Component  models  are  substituted  by  prospection  models; 

•  Only  one  link  is  considered; 

•  No  observation  index  is  considered. 

4.1.1  Generalized  prospection  graphs 

The  notion  of  the  prospection  graph  of  a  single  link  can  be  naturally 
extended  to  that  of  a  set  of  links.  Let  L  =  {L\ , . . . ,  Lm}  be  a  set  of 
links  (with  queue  domains  Qi, . . . ,  Qm,  respectively)  connecting  a 
set  C  =  {Ci, . . . ,  Ct}  of  components,  where  each  component  Ci, 
i  €  [1 ..  t],  is  characterized  by  model 

Mi  =  (Si,  Eini ,  Ii,  Eouti ,  Oi,  T,:). 

Let  M"  =  (Sf,  Ef,  Tf)  be  the  nondeterministic  automaton  ob¬ 
tained  from  Mi  in  such  a  way  that 


•  S'1  =  S,  is  the  set  of  states; 

•  E“  C  Ti  U  {e}  is  the  set  of  events; 

•  T'1  :  S"  x  E"  h->  2®1  is  the  transition  function. 

The  transition  function  T{  is  obtained  from  T,  as  follows: 

yj,  _  £  ,  g'  g  t  /  5  — S'  €  T?  if  Relevant  (a,  f3 ,  L) 

'  |  S  -C  S'  £  Tf  otherwise 

where 

Relev  ant  (a,  /3 ,  L)  ^  ({Link(a)}  U  Link(/3))  HL  ^  0. 

Let  Mi  =  (Si,  Ei,  Ti)  be  the  deterministic  automaton  equivalent 
to  Mf.  A  generalized  prospection  state  £  of  L  is  a  pair 

£  =  (§,Q) 

where 

S  =  (Si,.,.,  St)  €  (Sr  x  •••  x  St), 

Q  =  (Qi,  ■  •  • , Qm)  €  (Ql  x  •  •  •  x  Qm). 

Let  £  =  (S,  Q)  be  a  generalized  prospection  state  and 

Si^  S'  €  T i,  *  £  [1 ..  t\,  T  =  S  ^  S'. 


Figure  4.  Generation  of  the  generalized  prospection  graph  T(L,  To )  (see  Example  4). 


The  generalized  Next  function  yields  the  set  of  next  generalized 
prospection  states  as  follows: 

Next(£,T)  d=!f  {£'  |  £'  =  (S',Q'),S'  =  S't), 

Q'  =  (Ql,  ■  •  -  =  ( Ea,la ), 

((Link(Ia)  0  L)  or 

( Link(Ia )  =  Lj,  Lj  £  L,  e  €  _EQ,  Head(Qj )  =  e, 

Q'  =  Tail(Qj))), 

L/3  =  {-k/3  I  i/3  =  Link(Op),  (Ei3,Op)  £  /?,  L/3  £  L}, 

VLfe  £  L^(e  £  S/3,  (S/3, 0/3 )  €  /?,  L/i  =  Link(Op), 

Q'h  =  Ins(Qh,e), 

(\Qh\  <  Xh  or 

(IQhl  =  Xh,  (e  =  e  or  Ph  £  {T05S,  OFSSS/SS}))), 
VI-*,  £  (L  -  (L/3  U  {imfc(7a)}))  (Q'k  =  Qk), 

S'i  =  S' ,Vx  £  [1  ..t],x  ^  i  ( S'x  =  Sx)}. 

Let  Co  =  (S'oi  Sot)  be  the  record  of  initial  states  for  compo¬ 
nents  in  C.  The  generalized  spurious  prospection  graph  of  L  and 
Co  is  the  nondeterministic  automaton 

fn(L,C0)  =  (Sn,  En,  Tn,  Sq,  S" ) 

where 

Sn  =  {£  |  £  is  a  prospection  state  of  L}  is  the  set  of  states, 

En  C  |J*=1E,  C  (J*=1Ti  is  the  set  of  events, 

Sq  =  (Co,  ((}•••  (}))  is  the  initial  state, 

S"  =  {£  |  £  £  Sn,  £  =  (S,  ((}•••  (}))}  is  the  set  of  final  states, 
Tn  :  Sn  xE"h  2s"  is  the  transition  function  defined  as  follows: 

£  £'  £  Tn  iff  £'  £  Next(£,,  T). 


The  generalized  nondeterministic  prospection  graph  is  the  non¬ 
deterministic  automaton 

Tn(L,Co)  =  (Sn, En, Tn,  So,  S“) 

obtained  from  Tn(L,Co)  by  removing  inconsistent  states  and  in¬ 
consistent  transitions. 

The  generalized  prospection  graph 

T(L,  C0)  =  (S,  E,  T,  So,  Sf) 

is  the  deterministic  automaton  equivalent  to  the  Tn(L,Co). 

Example  4.  Shown  in  Figure  4  is  the  generation  of  the  generalized 
prospection  graph  T(L,  Th/)  relevant  to  the  links  in  system  \t'  (see 
Figure  1),  where  L  =  {Li,L2}  and  iko  =  Specifically, 

outlined  on  the  left  are  the  prospection  models  of  components  A'  and 
Y,  namely  M  ( X )  and  M  (Y).  Shown  on  the  center  is  the  generation 
of  the  generalized  nondeterministic  prospection  graph  Fn(L,  iTo) 
(the  dash  part  of  the  graph  denotes  the  inconsistent  search  space), 
where  consistent  nodes  are  identified  by  labels  £0  ■  ■  ■  £6-  Finally, 
displayed  on  the  right  is  the  corresponding  deterministic  prospec¬ 
tion  graph  r(L,  The  latter  is  determined  based  on  the  subset 
construction  algorithm  presented  in  [1],  which  identifies  each  node 
of  the  deterministic  automaton  by  means  of  a  subset  of  nodes  of 
the  nondeterministic  one,  specifically,  those  nodes  that  are  reach¬ 
able  through  the  same  marking  transition.  For  example,  since  there 
are  two  edges,  leaving  the  same  state  £6  in  the  nondeterministic 
automaton,  that  are  marked  by  the  same  label  *9,  the  deterministic 
automaton  will  include  the  node  identified  by  the  subset  {£3,  £7}, 
which  is  reached  from  {£g }  by  means  of  the  (unique)  edge  marked 
by  xg.  According  to  the  algorithm,  each  node  in  the  deterministic 
automaton  that  includes  a  final  state  of  the  nondeterministic  one  is 
final  itself.  Nodes  of  the  deterministic  automaton  are  identified  by 
labels  0  ■  •  •  8.  □ 


Given  a  system  E,  in  order  to  exploit  the  prospection  knowledge  in 
the  reconstruction  process,  we  need  to  create  a  set  of  g  prospection 
graphs 

r(E)  =  {r(L1,Co1),...,r(L9,CoB)} 

such  that  equals  the  whole  set  of  links  in  S.  r(E)  is  a 

prospection  coverage  of  E. 

Algorithm  2.  ( Far-sighted  Reconstruction ) 

The  far-sighted  reconstruction  algorithm  is  a  variation  of  Algo¬ 
rithm  1.  First,  the  Q  field  of  a  node  denotes  a  record  of  g  states 
relevant  to  the  g  prospection  graphs  in  the  prospection  coverage 
r(E),  namely 

Q=  (7i,  ■■■,7s)- 

Moreover,  in  the  initial  node  No  =  (00,  Sro,  Q 0).  Qo  is  represented 
by  the  record  of  the  initial  states  of  the  corresponding  prospection 
graphs,  namely  (7oj , . . . ,  7og  )■  Finally,  Step  4  of  Algorithm  1  is 
changed  as  follows: 

For  each  i  in  [1 ..  n],  for  each  transition  T  within  the  model  of 
component  Ci,  if  T  is  triggerable,  that  is,  if  the  following  two 
conditions  hold 

(i)  T  is  consistent  with  OBS(Y,); 

Let  n(T)  =  {fi,...  ,fr}  be  the  prospection  graphs 

in  r(E)  that  are  relevant  to  links  connected  with  terminals 
on  which  events  are  either  consumed  or  generated  by  T;  let 
Q(N)  =  {71, . . . ,  7r}  be  the  demerits  of  Q(N)  relevant  to 
n  (T): 

(ii)  Vi  €  [1 ..  r]  (7i  7'  is  an  edge  in  fj), 

then  do  the  following  steps: 

(a)  Create  a  node  (N1  =  (o' ,  $j',  Q'))  :=  N; 

(b)  o'[i]  :=  the  state  reached  by  T; 

(c)  IfT  is  observable,  then  Sr  :=  Sr  +  1; 

(d)  Replace  the  elements  of  Q'  relevant  to  Q(N)  with  the  new 
prospection  states: 

(e)  If  N'  0  M  then  insert  N'  into  M; 

(f)  Insert  edge  N  N'  into  £. 

Essentially,  Algorithm  2  exploits  the  knowledge  about  the  consis¬ 
tency  of  link  states  by  means  of  the  prospection  graphs  generated 
off-line,  thereby  preventing  the  search  from  entering  (possibly  large) 
inconsistent  parts  of  the  space.  Of  course,  such  a  prospection  is  fi¬ 
nite,  thereby  not  eliminating  completely  the  backtracking.  Besides, 
it  allows  for  an  efficient  treatment  of  nondeterminism  caused  by 
uncertain  events.  Recall  that,  in  short-sighted  reconstruction,  such 
situations  can  only  be  dealt  with  by  mere  enumeration  of  all  possible 
new  link  states  generated  by  the  collection  of  output  events  of  the 
current  transition.  For  example,  if  T  generated  3  uncertain  events 
(on  three  different  links),  each  of  which  represented  by  a  disjunction 
of  2  values,  then  we  would  have  8  new  nodes.  Instead,  since  the 
prospection  graphs  are  deterministic,  with  far-sighted  reconstruction 
only  one  new  node  is  generated,  as  at  most  one  edge  marked  by  T 
can  leave  each  current  state  of  the  prospection  graphs. 

Proposition  1.  Let  p(E)  be  a  diagnostic  problem  and  ||  A||  denote 
the  (possibly  unbound)  set  of  histories  incorporated  in  an  active 
space  A.  Let  Acfs(p(E))  and  Acti(p( E))  denote  the  active  spaces 
generated  by  Algorithm  1  and  Algorithm  2,  respectively.  Then, 

\\Acf  (p(E))||  =  ||Acff(p(E))||. 


Figure  5.  Far-sighted  reconstruction  space  (see  Example  4). 


Example  5.  Shown  in  Figure  5  is  the  reconstruction  space  for  the 
diagnostic  problem  p(T')  =  ({a,b,  c,d) ,  (Xi,Yi))  based  on  the 
generalized  prospection  graph  outlined  on  the  right  of  Figure  4.  It 
is  striking  comparing  it  with  the  short-sighted  reconstruction  (based 
on  Algorithm  1)  displayed  in  Figure  2.  While  the  number  of  con¬ 
sistent  states  (15)  is  necessarily  equal  in  both  reconstructions,  the 
far-sighted  reconstruction  space  includes  one  inconsistent  state  only, 
against  the  14  inconsistent  states  of  the  short-sighted  reconstruction 
space.  In  fact,  while  the  two  states  on  top  of  both  graphs  are  the 
same,  there  is  a  right  branch  stemming  from  the  latter  of  such  states 
in  the  short-sighted  reconstruction  which  is  missing  in  the  far-sighted 
reconstruction.  This  branching  is  actually  disabled  by  prospection 
graph  r({Li,L2},(Ai,Yi)),  which  constraints  the  occurrence  of 
all  the  transitions  involved  in  event  exchange  on  the  links  of  sys¬ 
tem  \E':  according  to  this  prospection  graph,  only  transition  y\  is 
allowed  to  follow  *1,  while  y2,  the  responsible  for  the  blind  alley 
in  Figure  2,  is  not.  □ 

5  CONCLUSION 

Referring  to  the  active  system  approach  [2,  3]  to  diagnosis  of  DESs, 
this  paper  has  shown  how  the  off-line  compilation  of  knowledge 


about  event  exchange  between  components  brings  a  computational 
advantage  on-line  in  terms  of  reduction  of  the  number  of  backtrack¬ 
ing  steps  performed  by  the  history  reconstruction  algorithm.  This 
advantage  is  expecially  tangible  when  relaxing  a  strong  assumption 
of  all  the  state-of-the-art  approaches  to  diagnosis  of  DESs,  namely, 
the  preciseness  of  events.  In  this  work,  all  input  and  output  events 
in  behavioral  models,  and  not  only  observable  events,  as  instead 
in  [14],  may  have  an  imprecise  value  ranging  over  a  set  of  la¬ 
bels,  namely  an  uncertain  value.  In  presence  of  uncertain  events, 
the  search  performed  by  short-sighted  diagnosis  is  nondeterministic, 
while  that  carried  out  with  the  support  of  prospection  knowledge  is 
deterministic.  Moreover,  prospection  graphs,  once  generated  off-line, 
can  be  reused  several  times  on-line  for  different  diagnostic  problems 
inherent  to  the  same  system,  or  even  for  the  same  diagnostic  problem 
in  case  there  are  repetitive  link  patterns  in  the  system  structure. 

A  previous  proposal  [13],  based  itself  on  knowledge  compilation, 
transforms  the  active  system  approach  into  a  spectrum  of  approaches 
which,  according  to  the  classification  in  Section  1,  range  from  a 
totally  first  category  version,  wherein  an  exhaustive  simulation  of 
the  system  evolution  is  performed  off-line,  while  on-line  activities 
are  limited  to  rule-checking,  to  a  totally  second  category  version, 
i.e.  the  original  approach  wherein  no  computation  is  performed  off¬ 
line.  Each  approach  falling  in  between  consists  of  both  off-line  and 
on-line  processing.  The  contribution  of  this  paper  is  orthogonal  to 
that  work,  that  is,  it  could  be  integrated  within  any  version  of  the 
spectrum  (with  the  exception  of  the  exclusively  on-line  one)  in  order 
to  reduce  backtracking  steps  in  any  reconstruction. 

The  exchange  of  events  among  components  dealt  with  in  this 
paper,  being  both  asynchronous  and  buffered,  is  peculiar  only  to 
the  active  system  approach.  One  might  argue  that  providing  for  a 
specific  modeling  primitive,  namely  the  link,  for  the  structural  ob¬ 
jects  that  implement  asynchronous  buffered  communication  between 
components,  along  with  specific  methods  for  dealing  with  them,  just 
increases  the  expressive  power  of  the  method  but  does  not  alter  its 
computational  power  at  all.  In  fact,  each  link  could  be  replaced  by  a 
common  component,  whose  behavioral  model  represents  the  link  be¬ 
havior,  and,  therefore,  synchronous  composition  of  automata  would 
suffice.  This  is  correct  in  principle  but  scarcely  feasible  in  practice, 
for  many  reasons.  First,  the  size  of  the  behavioral  model  of  such  a 
component  depends  not  only  on  the  capacity  of  the  link  buffer  but 
also  on  the  number  of  distinct  kinds  of  events  that  can  be  transmit¬ 
ted  on  the  link.  For  instance,  let  us  consider  a  link  with  capacity 
equal  to  three,  on  which  four  kinds  of  events,  say  a,  6,  c,  and  d,  can 
be  transmitted.  As  each  state  of  the  component  representing  the  link 
is  univocally  identified  by  the  sequence  of  events  in  the  buffer,  the 
behavioral  model  of  such  a  component  has  X^=o(4fc)  =  85  states! 
So  large  a  model  is  a  burden  for  history  reconstruction.  In  fact,  the 
model  may  be  unduly  large  as  it  includes  even  states  that  are  phys¬ 
ically  impossible  given  the  system  structure,  since  corresponding  to 
sequences  of  events  that  cannot  be  generated. 

Besides,  as  remarked  above,  such  a  model  depends  on  the  kinds 
of  events  that  can  be  transmitted  on  the  link,  that  is,  it  depends  on 
the  producer  component  of  the  link  at  hand.  This  is  somewhat  in 
contrast  with  the  philosophy  of  compositional  modeling,  according 
to  which  individual  component  models  are  reciprocally  independent. 

Instead,  in  the  active  system  approach  and,  consequently,  in  this 
paper,  a  link  is  just  the  instantiation  of  a  model,  encompassing  only 
the  terminals,  capacity,  and  policy  of  the  link,  and  such  a  model  is 
independent  of  the  structure  of  the  system  in  which  the  link  is  in¬ 
stantiated.  Of  course,  notwithstanding  the  modeling  simplicity,  link 
states  are  bound  to  emerge  in  the  computation,  sooner  or  later.  The 
methods  introduced  in  this  paper  are  actually  aimed  at  minimizing 
the  number  of  physically  impossible  link  states  (and,  hence,  since  a 
link  state  is  a  part  of  any  active  system  state,  the  number  of  active 


space  states)  visited  by  the  history  reconstruction  search  algorithm. 
In  short-sighted  diagnosis,  where  a  link  state  is  represented  as  a  se¬ 
quence  of  events,  not  all  sequences  of  events  are  considered  but  only 
those  that  can  be  generated  given  the  system  structure.  In  far-sighted 
diagnosis,  where  the  state  of  one  or  several  links  becomes  a  record 
of  indexes,  the  number  of  visited  link  states  is  further  reduced:  only 
those  states  are  generated  that  can  evolve  towards  a  state  wherein 
the  link  is  empty. 
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Model-based  Monitoring  of  Piecewise  Continuous 
Behaviors  using  Dynamic  Uncertainty  Space  Partitioning1 
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Abstract.  Monitoring  gains  importance  for  many  technical  systems 
such  as  robots,  production  lines  or  anti  lock  brakes.  A  monitoring 
system  for  technical  systems  must  be  able  to  deal  with  incomplete 
knowledge  of  the  supervised  system,  to  process  noisy  observations 
and  to  react  within  predefined  time  windows.  This  paper  presents  a 
new  approach  to  monitoring  technical  systems  based  on  imprecise 
models.  Our  approach  repeatedly  partitions  the  uncertainty  space  of 
an  imprecise  model  and  checks  the  derived  model’s  state  for  consis¬ 
tency  with  the  measurements.  Inconsistent  partitions  are  then  refuted 
resulting  in  a  smaller  uncertainty  space  and  a  faster  failure  detection. 
This  paper  further  focuses  on  the  extension  of  our  basic  approach 
to  monitoring  systems  that  exhibit  both  continuous  and  discrete  be¬ 
haviors.  Our  monitoring  system  has  been  implemented  using  COTS 
components  and  has  been  demonstrated  in  online  monitoring  of  a 
non-trivial  heating  system. 

Keywords:  fault  detection;  hybrid  systems;  imprecise  models; 
residual  generation 

1  INTRODUCTION 

The  primary  objective  of  a  monitoring  system  is  to  detect  abnormal 
behaviors  of  a  supervised  system  as  soon  as  possible  to  avoid  shut¬ 
down  or  damage.  Technical  systems  such  as  robots,  production  lines 
or  anti  lock  brakes  provide  a  vast  number  of  challenges  for  a  monitor¬ 
ing  system,  i.e.,  it  must  be  able  to  deal  with  incomplete  knowledge 
about  the  supervised  system,  to  process  noisy  observations  and  to 
react  within  predefined  time  windows. 

A  particularly  important  and  widely-applied  approach  is  model  - 
based  monitoring  [6,  5]  which  relies  on  a  comparison  of  the  pre¬ 
dicted  behavior  of  a  model  with  the  observed  behavior  of  the  super¬ 
vised  system.  Our  approach  using  dynamic  uncertainty  space  par¬ 
titioning  [12]  is  based  on  imprecise  models  where  the  structure  of 
the  models  is  known  and  the  parameters  may  be  imprecisely  given 
as  numeric  intervals.  These  parameter  intervals  span  the  uncertainty 
space  of  the  model.  From  an  imprecise  model  based  on  intervals  only 
bounds  on  the  trajectory  (envelopes)  can  be  derived.  Dynamic  un¬ 
certainty  space  partitioning  keeps  the  envelopes  small  by  exploiting 
the  measurements  from  the  supervised  system  as  soon  as  possible. 
Whenever  new  measurements  arrive  residuals  are  generated  at  the 
“corner  points”  of  the  uncertainty  space  and  checked  for  consistency 
by  comparing  their  signs.  This  results  in  a  fast  fault  detection  [12]. 

The  fundamental  assumption  of  dynamic  uncertainty  space  par¬ 
titioning  is  that  the  model’s  state  values  are  monotonic  within  the 

1  This  work  has  been  supported  by  the  Austrian  Science  Fund  under  grant 
number  P14233-INF.  The  authors  are  in  alphabetical  order. 

2  Institute  for  Technical  Informatics,  Graz  University  of  Technology,  AUS¬ 
TRIA;  email:  [rinner,  uweiss ]  @iti  .tu-graz  .ac.at. 


range  of  the  uncertainty  space.  Discontinuous  transitions  in  the  sys¬ 
tem’s  model  may  introduce  non-monotonic  behaviors  in  the  state  val¬ 
ues  and,  therefore,  violate  our  assumption  for  the  consistency  check. 
In  order  to  preserve  a  conservative  monitoring  approach  for  hybrid 
systems,  we  have  to  extend  our  consistency  check  by  a  monotonic¬ 
ity  check.  Whenever  the  monotonicity  of  the  state  values  is  given  the 
consistency  check  can  be  performed  potentially  resulting  in  a  refu¬ 
tation  of  the  imprecise  model.  If  the  monotonicity  is  not  known  the 
consistency  check  is  simply  ignored  and  no  model  is  refuted. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  de¬ 
scribes  the  technical  details  of  uncertainty  space  partitioning  and  the 
consistency  check.  Section  3  discusses  the  necessary  extensions  of 
our  approach  to  monitoring  systems  which  exhibit  both  continuous 
and  discrete  behaviors.  Section  4  presents  experimental  results  of  our 
monitoring  approach  in  a  real-world  system  with  several  changes  of 
a  input  value,  A  discussion  and  a  summary  of  related  work  conclude 
this  paper. 

2  MONITORING  BASED  ON  UNCERTAINTY 
SPACE  PARTITIONING 

2.1  Overview 

Monitoring  methods  based  on  imprecise  models  can  reason  with  in¬ 
complete  knowledge  in  the  model  as  well  as  with  noisy  measure¬ 
ments.  A  main  drawback  of  this  approach,  however,  is  that  the  en¬ 
velopes  may  diverge  very  rapidly  which  delays  or  even  inhibits  a 
fault  recognition.  We  have  revised  this  interval  approach  to  model- 
based  monitoring  with  the  primary  goal  to  keep  the  resulting  en¬ 
velopes  as  small  as  possible. 

In  our  approach,  we  exploit  the  measurements  from  the  supervised 
system  as  soon  as  possible  to  refine  the  uncertainty  in  the  model  and 
the  derived  envelopes.  The  key  step  in  our  approach  is  to  partition  the 
uncertainty  space  of  the  model  into  several  subspaces.  The  trajecto¬ 
ries  derived  from  each  subspace  are  then  checked  for  consistency 
with  the  measurements.  Each  inconsistent  subspace  is  refuted  and 
excluded  from  further  investigations.  Partitioning  and  consistency 
checking  are  continued  resulting  in  a  smaller  uncertainty  space  of  the 
model.  When  all  subspace  are  refuted,  a  discrepancy  between  model 
prediction  and  observation  has  been  recognized  and  a  fault  has  been 
detected. 

2.2  Subspace  Partitioning  and  Consistency 
Checking 

In  general,  a  technical  system  can  be  modeled  as 

xt  =  f(xt_i,ut_i,pt_i) 

yt  =  g(xt,pt) 


where  x*  is  the  state  vector  at  discrete  time  t ,  ut  is  the  input  vec¬ 
tor  at  time  t ,  pt  is  the  parameter  vector  at  time  t ,  y t  is  the  out¬ 
put  vector  at  time  t,  and  g  and  f  are  vector  functions.  In  an  ex¬ 
act  model,  pt  is  a  vector  of  real  numbers.  However,  in  a  model 
with  uncertain  parameters,  pt  is  replaced  by  a  vector  of  intervals 
Pt  =  [(Pljt,Pi,t);  (P2^P2,t)> '  ' '  =  (PK  t’PK,t)]T j  where  K  is  the 
number  of  uncertain  parameters.  A  model  with  uncertain  parameters, 
i.e.,  an  imprecise  model,  can  therefore  be  described  as: 


Xf  =  f(xt— i,Ut-i,pt-i) 

y  t  =  g(xt,pt) 

Equation  2  is  the  starting  point  of  our  approach.  It  defines  an  im¬ 
precise  model  of  the  supervised  system  with  K  uncertain  parame¬ 
ters.  Thus,  this  model  has  a  Jf-dimensional  uncertainty  space.  In 
order  to  divide  this  uncertainty  space  we  have  to  define  a  partition 

q*  =  [(«! 't’Qi,t)Aq2it,Q2,t)r--,(gK it’QK,t)]T  with  qt  ^  p*- 

A  complete  partitioning  of  the  uncertainty  space  at  any  time  t  into 
M  partitions  must  satisfy  the  following  condition  [J  qjm')  =  pt 
where  m  =  1, . . .  ,  M .  A  model  based  on  a  partition  of  the  uncer¬ 
tainty  space  is  referred  to  as  subspace  model.  From  the  definition  of 
a  partition,  we  can  finally  define  the  state  of  a  subspace  model  to: 


=  g(i<™),q<™)). 


(3) 


With  the  monotonicity  assumption  of  f  and  g  with  regard  to  the 
parameters  pt  over  the  range  of  the  intervals,  the  (uncertain)  state  of 
a  subspace  model  can  be  represented  by  the  (exact)  state  of  the  corner 
points  of  a  subspace.  The  corner  points  of  a  subspace  are  defined  as 
all  combinations  of  upper  and  lower  bounds  of  a  partition  qt  and  can 
be  represented  as  set  =  {q^}  with  i  =  1, . . .  ,2K .  Thus,  an 
uncertainty  space  of  dimension  K  results  in  2K  corner  points.  The 
states  at  the  comer  points  can  be  represented  as  set 
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f(xt-M;u*-i,qi_i,i)} 

(m)  (m)N. 


(4) 


where  q^™-*  is  an  exact  parameter  vector  at  time  t  from  the  subspace 
to  and  at  corner  i  =  1, . . . ,  2K  of  this  subspace.  Note,  that  x^™^ 
are  state  vectors,  and  also  yffl  are  output  vectors  with  exact  values. 
Note  that  this  approach  assumes  that  the  parameters  of  the  system 
are  constant,  and  are  not  varying  in  time.  This  assumption  will  be 
discussed  later. 

This  representation  of  an  uncertain  state  is  directly  exploited  by 
our  consistency  check  for  a  given  subspace  to.  First,  a  residual  is 
calculated  for  each  state  at  a  comer  point  using  the  measurements 
at  time  t,  i.e.,  rffl  =  y treasured  ~  y\™\  where  rffi  has  the 

same  dimension  J  as  y*, measured  and  y*'™'*  ■  Then,  the  minimum 
and  maximum  values  of  the  residual  are  determined  as 


r(m) 

1  t,min,j 

r(™) 

1  t,max  ,j 


=  (5) 

=  max{r((“>}  (6) 


with  i  =  1, ...  ,2K .  and  j  =  1 , ...  ,J.  Finally,  subspace  model  to 
is  checked  for  consistency  simply  by  comparing  the  signs  of  r^r^lin  . 

and  I't’max  j  ■  The  subspace  model  to  is  consistent  with  the  measure¬ 
ments,  iff 

SSn(rt%n,j  )  +  S§ n(r<Z2ax  J  ) 


Figure  1.  Consistency  check  with  one  uncertain  parameter  p  and  three 
subspaces  qi.q2.  and  r/3 .  The  residuals  at  the  corner  points  of  subspace  q\ 
are  both  negative,  therefore,  the  model  with  the  subspace  q\  is  inconsistent 
with  the  measurement.  In  subspace  fp .  the  residuals  at  the  corner  points  have 
different  signs.  Thus,  q-2  is  consistent.  For  the  parameter  range  of  subspace 
q:i  the  monotonicity  assumption  is  violated.  In  this  case,  checking  the 
residuals’  signs  at  the  comer  points  is  not  feasible. 


holds  for  all  elements  j  =  1, ...  ,J. 

Informally,  Equation  7  checks  whether  the  zero  vector  lies  within 
the  “residual  subspace”  (see  Figure  1).  If  this  equation  is  violated,  the 
subspace  model  to  is  refuted.  This  simple  consistency  check  holds 
also  if  not  all  elements  of  y  are  included  in  the  measurements.  In 
this  case,  a  comparison  with  the  missing  elements  is  simply  ignored. 
Since  this  technique  is  based  on  the  calculation  of  an  exact  state  (at 
corner  points),  we  can  use  standard  numerical  methods  for  comput¬ 
ing  the  solution  of  differential  equations.  Note  that  subspaces  are 
only  refuted  when  they  are  genuinely  inconsistent  with  the  measure¬ 
ments. 

Due  to  the  uncertainty  in  the  parameters  this  method  may  result  in 
diverging  envelopes.  This  deviation  of  the  predicted  value  to  the  “cor¬ 
rect”  value  over  time  is  referred  to  as  accumulation  uncertainty.  In 
order  to  keep  this  deviation  small  we  have  also  introduced  a  dynamic 
partitioning  of  the  subspace  models.  During  monitoring  consistent 
subspaces  are  further  partitioned  resulting  in  smaller  subspace  mod¬ 
els  that  potentially  describe  the  supervised  system  more  precisely 
[12]. 

3  MONITORING  PIECEWISE  CONTINUOUS 
BEHAVIORS 

3.1  Monotonicity  at  Transitions 

In  order  to  extend  our  approach  to  monitoring  piecewise  continuous 
behaviors  and  discrete  transitions,  we  must  have  a  closer  look  at  our 
monotonicity  assumption.  Remember  that  the  result  of  our  consis¬ 
tency  check  is  only  valid  if  the  state  values  within  the  subspace  are 
monotonic. 

In  general  the  monotonicity  of  the  state  values  with  regard  to  the 
parameters  is  not  guaranteed  by  the  monotonicity  of  the  system  equa¬ 
tions  f  and  g.  The  monotonicity  is  only  given  when  the  following 
assumptions  also  hold: 

1.  the  system  input  u  does  not  change,  and 

2.  the  initial  values  of  a  subspace  model  are  the  same  over  its  com¬ 
plete  uncertainty  space. 

Both  assumptions  are  important  for  monitoring  discrete  and  con¬ 
tinuous  behaviors.  The  first  assumption  is  especially  relevant  for  tran¬ 
sitions  because  they  are  often  triggered  by  stepwise  changes  of  the 


system  input  (e.g.,  caused  by  operator  actions).  Such  transitions  vio¬ 
late,  therefore,  the  first  assumption.  The  second  assumption  is  a  sim¬ 
ple  consequence  of  the  integration  of  the  given  differential  equation: 


t 


to 


If  the  initial  states  xt0  are  different  at  some  corners  in  the  sub¬ 
space  model,  the  state  values  xt  may  not  be  monotonic  (even  if  x  is 
monotonic).  However,  monotonicity  is  guaranteed  after  some  time. 

As  discussed  above  discontinuous  transitions  may  result  in  a  non¬ 
monotonicity  of  the  state  values  with  regard  to  the  parameters  (for 
a  limited  period  of  time),  which  in  turn  leads  to  an  incorrect  consis¬ 
tency  check.  Thus,  to  maintain  a  correct  (and  conservative)  monitor¬ 
ing  technique  we  must  extend  the  consistency  check  by  a  check  for 
monotonicity.  If  the  monotonicity  is  not  guaranteed  the  consistency 
check  is  simply  ignored  and  this  subspace  can  not  be  refuted.  At 
some  time  after  the  transition  the  subspace  may  become  monotonic 
again  and  the  consistency  check  can  be  applied  again. 


Figure  2.  Monotonicity  check  with  one  state  value  and  one  parameter.  To 
check  the  subspace  model  for  monotonicity,  the  gradients  of  the  state  values 
with  regard  to  the  parameters  are  calculated  at  the  comer  points.  In  this 
example,  the  subspace  q\  is  monotone  and  the  subspace  <72  violates  the 
monotonicity  check. 

use  also  a  differential  description  of  the  system  (f  =  x),  the  mono¬ 
tonicity  check  does  not  significantly  increase  the  computational  load. 
Note  that  matrix  A  is  constant  for  linear  systems. 


3.2  Checking  for  Monotonicity 

The  monotonicity  of  the  state  values  for  an  individual  subspace  is 
checked  by  the  following  method. 

We  define  a  matrix  B(f,  x,  p)  with  the  elements 


p) 


dxj(t,x,  p) 

dpj 


(9) 


where  t  is  the  time,  x  the  state  vector,  and  p  the  parameter  vector 
with  its  elements  pj.  We  also  define  the  matrix  C  (t,  x,  p)  with  the 
elements 


4  THE  MONOTONICITY  CHECK  IN  A 
REAL-WORLD  SYSTEM 

We  now  examine  the  monotonicity  behavior  on  a  “real”  technical 
system  which  is  comprised  of  three  heating/cooling  components 
mounted  on  a  thermal  conductive  plate.  A  process  control  com¬ 
puter  (B&R  2003)  controls  the  three  heating/cooling  components. 
The  measured  samples  as  well  as  the  control  actions  issued  are  trans¬ 
ferred  to  the  monitoring  system  via  a  RS  232  interface. 

Our  model  which  includes  the  three  components  with  heating  ele¬ 
ments  is  given  as 


Cij(f,x,p)  =  (10) 

dpj 

The  matrix  C(t.  x,  p)  is  calculated  by 

rfC<~dfX’P)  =  A(*’X’P)C(*’X’P)  +  B(^X>P)>  (u) 

where  C(0,xo,p)  =  0  (the  empty  matrix),  and  the  matrix 

A (t,  x,  p)  is  defined  as 


X,  p) 


dxj(t,x,  p) 
dxj 


(12) 


The  elements  dj(t,  x,p)  give  us  the  trend  of  the  state  value 
Xi(t,  p)  with  regard  to  the  parameter  pj.  This  is  exploited  by  our 
monotonicity  check  The  state  values  of  a  subspace  model  are  mono¬ 
tonic,  iff 


Sgn {cij.min)  —  SgTl{cij.max  )  (13) 

holds  for  all  state  values  i  =  1. ....  7  and  all  directions  of  the 
uncertainty  space  j  =  1, ...  ,K.  Cij.min  are  the  appropriate  val¬ 
ues  of  dj(t ,  x,  p)  at  the  corner  min,  and  Cij.max  are  the  values  of 
dj  (t,  x,  p)  at  the  corner  max  of  that  subspace  model  (as  described 
with  Equations  5  and  6). 

Figure  2  depicts  the  monotonicity  check.  In  general,  the  informa¬ 
tion  at  the  corner  points  is  not  sufficient  to  decide  on  monotonicity. 
However,  assuming  the  monotonicity  of  the  functions  f  and  g  with 
regard  to  the  parameter,  the  monotonicity  check  becomes  sufficient. 

The  calculation  of  the  monotonicity  check  implies  a  numerical  so¬ 
lution  of  the  differential  equation  (Equation  12).  However,  since  we 


Ti  =  ^ ,  -  Li(Tj  -  To)  -  L12{T1  -  Ta)) 

T2  =  ^(qi2  +  Li2(Ti  -  T2)  -  L2(T2  -  To) 

—  L23(T2  ~  T3)) 

T3  =  jj^(qi3  +  L23{T2  —  T3)  —  L3(T3  —  To)) 

where  T  is  the  temperature  of  the  three  components,  Ci  is  the 
mass  of  the  components,  qi  is  the  heat  flow  into  the  components,  Li 
the  thermal  conductivity  between  the  component  i  and  the  environ¬ 
ment,  L{j  the  thermal  conductivity  between  the  component  i  and  j, 
and  To  the  temperature  of  the  environment.  We  can  reduce  the  com¬ 
plexity  of  this  model  by  exploiting  the  symmetric  construction  of  the 
heating  system  (L3  =  L\.  L23  =  L 12,  C3  =  Ci)  resulting  in  a  total 
of  five  uncertain  parameters. 

The  state  vector  is  given  as  x  =  (Ti ,  T2,  Ts)t,  the  input  vector  as 
u  =  {qi  1 ,  9*2,  qi3 ,  To)T,  and  the  output  vector  as  y  =  (Ti  +m ,  T2  + 
«2,  T3  +ri3)T,  where  m  is  the  noise  of  each  temperature  sensor.  The 
noise  parameters  are  also  included  in  the  uncertainty  space  resulting 
in  a  total  of  eight  uncertain  parameters.  Note  that  noise  parameters 
are  not  dynamically  partitioned  into  smaller  intervals  and  they  are 
not  considered  by  the  monotonicity  check. 

We  have  measured  the  input  values  with  q0/f  =  1.24W  and 
qOH  =  34.8IL  (heating  element  is  either  turned  off  or  turned  on). 
With  an  initial  refinement  step,  we  get  the  parameter  intervals  as 
Li  =  [0.12,0.13],  L2  =  [0.15,0.18],  L12  =  [0.62,0.73],  Ci  = 
[51,  54],  C 2  =  [61,  65].  The  refinement  step  is  performed  in  a  single 
continuous  behavior  segment  [12]. 

To  examine  the  non-monotonic  behavior  in  the  system,  we  ob¬ 
serve  the  system  after  a  transition,  and  count  the  subspace  models. 


Figure  3.  Measurements  from  the  heating  system  used  for  monotonicity  checking.  The  input  H2  is  generated  by  the  process  control  computer  and  sent  to  the 

monitoring  system. 


which  are  marked  as  non-monotonic.  Over  time,  this  gives  us  a  pic¬ 
ture,  how  the  transition  produce  non-monotonicity  in  the  state  values. 
We  choose  the  following  scenario: 

Control  state  1:  Heat  T2  until  T2  reaches  70.  Then  go  to  state  2. 
Control  state  2:  Heat  T2,  if  T2  <  70.  If  t state 2  >  lOOsec,  go  to 

state  3. 

Control  state  3:  Heat  T2,  if  T2  <  90.  If  tstate 3  >  lOOsec  f\T2  > 

90,  go  to  state  4. 

Control  state  4:  Do  not  heat.  If  T2  <  50,  go  to  state  1. 

Figure  3  plots  the  resulting  measurements  for  this  scenario.  The 
heating  flag  H2  (generated  by  the  PCC)  is  used,  to  get  a  discrete 
change  of  an  input.  To  implement  the  heating  element  characteristic, 
we  assume  an  additional  mass  Ch  and  a  thermal  conductivity  L2h 
between  component  2  and  the  heating  mass: 

Th2  =  -^-(32.827T2  —  L2h(Th2  —  T2))  (15) 

C/t 

qi2  =  1.24  +  L2h{Th2-T2)  (16) 

To  demonstrate  the  non-monotonic  effect  after  a  stepwise  change 
of  an  input,  we  check  the  monotonicity  of  all  subspace  models,  and 
count  non-monotonic  subspace  models,  i.e.,  which  violate  Equa¬ 
tion  13.  Figure  4  shows  a  part  of  the  scenario,  where  the  temperature 
of  component  2  is  hold  at  90  degree  (control  state  3).  For  this  plot,  we 
have  started  with  128  subspace  models,  and  no  dynamic  partitioning 
is  introduced.  Due  to  the  discrete  controller  the  heating  is  turned  on 
and  off  several  times.  At  each  transition  about  40  subspace  models 
are  non-monotonic.  An  interesting  observation  in  this  figure  is,  that 
the  non-monotonic  subspaces  disappear  quickly,  if  the  heating  flag  is 
turned  off  only  for  a  short  time. 

Figure  5  shows  the  number  of  the  non-monotonic  subspace  models 
after  control  state  3.  The  peak  here  is  about  30  subspace  models.  It 
shows,  that  non-monotonic  subspace  models  are  also  existing  for  a 
“longer”  time  period  (here  about  400  seconds)  after  the  last  discrete 
change  of  an  input. 

Non-monotonic  subspace  models  are  not  refuted,  and,  therefore, 
do  not  make  any  contribution  to  decrease  the  uncertainty  space.  Al¬ 
though  the  number  of  non-monotonic  subspace  models  are  quite  high 
(about  50  percent  of  the  current  subspace  models)  for  some  times,  it 
has  not  a  significantly  influence  to  the  refutation.  The  reason  is,  how¬ 
ever,  that  such  peaks  does  not  hold  for  long  time,  so  the  consistency 
check  soon  becomes  valid  again.  At  this  example  the  number  of  con¬ 
sistent  subspace  models  at  the  end  of  the  scenario  is  about  20. 


Figure  5.  The  non-monotonicity  after  the  switching  period.  Drawn  are  (as 
same  as  in  figure  4)  the  measurement  and  the  envelopes  of  T2,  the  heating 
flag  H 2  and  the  number  of  non-monotonic  subspace  models  N  M.  Some 
subspace  models  are  non-monotonic  after  the  heating  period. 

5  DISCUSSION 

In  this  paper,  we  have  presented  a  model-based  monitoring  approach 
based  on  uncertainty  space  partitioning.  The  fundamental  assump¬ 
tion  of  this  approach  is  the  monotonicity  of  the  state  values  with  re¬ 
gard  to  the  range  of  the  parameters.  In  systems  which  exhibit  both 
discrete  and  continuous  behaviors  the  monotonicity  can  not  be  guar¬ 
anteed  only  by  the  monotonicity  of  the  vector  functions.  Thus,  in  or¬ 
der  to  apply  our  basic  approach  to  monitor  hybrid  systems,  we  have 
introduced  a  monotonicity  check  for  the  state  values. 

Note  the  difference  of  monitoring  based  on  pre-calculated  en¬ 
velopes  with  our  approach.  With  pre-calculated  envelopes,  the  en¬ 
velopes  remain  constant  over  the  complete  monitoring  process.  In 
our  approach,  the  envelopes  may  become  smaller  than  the  initial  ones 
due  to  the  refutation  of  inconsistent  subspaces  during  monitoring. 
This  results  in  an  earlier  detection  of  faults.  However,  there  is  a  sig¬ 
nificant  increase  in  the  computational  load  of  subspace  partitioning. 

Our  approach  is  based  on  computing  the  envelopes  of  differen¬ 
tial  equations.  For  complex  models,  the  overall  runtime  of  our  mon¬ 
itoring  algorithm  is  dominated  by  solving  the  differential  equations, 
especially  when  a  high-precise  method  such  as  Runge-Kutta  is  used. 
The  computational  complexity  of  our  algorithm  for  a  single  time-step 
can  be  estimated  as 

0{M2k  (p  +  p))  (17) 

where  M  is  the  number  of  partitions,  K  is  the  number  of  uncertainty 
parameters,  p  is  the  time  of  the  Runge-Kutta  algorithm,  and  p  is  the 
time  of  the  matrix  multiplication  according  to  Equation  1 1 .  The  time 
p  strongly  depends  on  the  dynamic  properties  of  the  system,  and  for 
high  dynamic  systems,  the  assumption  p  >  p  holds. 


Figure  4.  First  overview  of  the  monotonicity  of  the  technical  system.  Drawn  are  the  measured  T2  with  its  envelopes,  H2  is  the  heating  flag  for  the  second 
component,  and  N  M  is  the  number  of  the  non-monotone  subspace  models.  The  discrete  change  of  the  input  makes  a  directly  effect  to  the  monotonicity  of  the 

state  values. 


This  approach  can  also  be  seen  as  system  identification ,  because 
refuting  subspace  models  reduces  the  uncertainty  space,  resulting  in 
smaller  bounding  intervals  on  the  parameters.  Measurement  noise 
can  also  be  handled  by  introducing  additional  uncertainty  parameters 
into  the  model. 

However,  this  approach  is  in  contrast  to  traditional  system  identi¬ 
fication  where  the  model  space  is  specified  by  a  parameterized  dif¬ 
ferential  equation.  Identification  selects  numerical  parameter  values 
so  that  simulation  of  the  model  best  matches  the  measurements.  By 
using  refutation  instead  of  search  our  method  is  able  to  derive  guar¬ 
anteed  bounds  on  the  trajectories. 

Model-based  monitoring  using  uncertainty  space  partitioning  is 
related  to  the  interval  identification  algorithm  of  Schaich  et  al.  [13]. 
In  their  approach  the  consistency  check  is  only  performed  at  the 
qualitative  level.  Thus,  valuable  detection  time  is  lost,  as  long  as  the 
fault  is  only  manifested  in  a  quantitative  value.  Petridis  an  Kehagias 

[10]  have  also  developed  an  algorithm  with  subspace  partitioning. 
The  partitioning  is  only  performed  in  advance  and  the  consistency 
check  is  based  on  probabilities  depending  on  the  noise  in  the  sys¬ 
tem.  Other  work  in  monitoring  [7,  9,  3]  uses  multiple  models  for 
fault  detection.  These  models  represent  known  faults  of  the  super¬ 
vised  system.  From  the  viewpoint  of  system  identification,  our  ap¬ 
proach  is  closely  related  to  semi-quantitative  system  identification 

[8],  Identification  of  both  approaches  are  grounded  on  the  refuta¬ 
tion  of  subspace  models  that  are  known  to  be  inconsistent  with  the 
measurements.  Semi-quantitative  system  identification  performs  re¬ 
finement  at  the  qualitative  and  interval  level.  Semi-quantitative  sys¬ 
tem  identification  has  also  been  applied  to  model-based  monitoring 

[11] ,  Bonarini  and  Bontempi  [4]  have  developed  a  quite  similar  ap¬ 
proach  to  our  consistency  check.  However,  they  have  focused  on  un¬ 
certainty  initial  state  values,  which  are  given  as  intervals.  Also  re¬ 
lated  to  our  work  is  Armengol  et  al.  [1,2].  The  simulation  is  based 
on  modal  interval  arithmetics,  which  produces  overbounded  and  un¬ 
derbounded  envelopes  of  a  technical  system.  To  minimize  the  rate 
of  false  and  missed  alarms,  the  uncertainty  space  is  only  partitioned 
at  critical  measurements  (which  are  between  the  underbounded  and 
overbounded  envelopes).  In  comparison  to  our  approach,  we  sim¬ 
ulate  at  each  corner  of  the  uncertainty  space,  which  leads  to  exact 
envelopes  (no  false  and  missed  alarms,  according  to  observability) 
for  linear  systems. 

Directions  for  future  work  include  (i)  the  incorporation  of  (un¬ 
known)  discontinuous  transitions  in  our  monitoring  approach,  (ii) 


further  investigations  on  the  monotonicity  properties  after  a  discon¬ 
tinuous  transition,  especially  in  the  context  of  non-linear  systems, 
and  (iii)  the  improvement  of  the  dynamic  uncertainty  space  parti¬ 
tioning. 
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Abstract 

The  object-oriented  paradigma  is  a  new  but  proven  technol¬ 
ogy  for  modelling  mechatronics,  i.e.  multidisciplinary  mod¬ 
elling.  For  many  reasons  the  object-oriented  approach  is 
very  much  desirable  also  for  qualitative  models  in  system 
design,  diagnosis  or  verification.  Bayesian  networks  are  a 
very  robust  technology  for  qualitative  probabilistic  model¬ 
ling.  In  this  paper  we  present  a  first  approach  in  using  the 
Bayesian  networks  modelling  technique  with  the  quantita¬ 
tive  object-oriented  method.  Analogous  to  Modelica,  an 
object-oriented  modelling  language,  we  constructed  a  Baye¬ 
sian  network  library  for  modelling  hydraulic  systems.  These 
Bayesian  networks  are  called  Object  Oriented  Dynamic 
Bayes  Nets  (OODBNs).  Our  method  is  easily  transferable  to 
any  other  physical  domain  or  logic.  In  this  contribution  our 
motivation  and  the  construction  steps  are  described.  Simula¬ 
tion  results  for  a  sample  hydraulic  system  are  given. 

Introduction 

Future  system  architectures  will  be  characterized  by 
highly  modular  and  reusable  components,  and  by  abstract 
description  languages  widely  independent  of  implementa¬ 
tion  details.  Typical  components  of  system  architectures  are 
software  and  hardware  (sub-)systems.  On  the  Software  side 
the  object-oriented  paradigma  is  by  now  (at  least  in  indus¬ 
trial  applications)  the  de  facto  description  or  modelling  lan¬ 
guage  standard,  mostly  represented  by  the  Unified 
Modelling  Language  (UML).  On  the  Hardware  side,  which 
is  our  focus  here,  we  have  mechatronic  hardware  compo¬ 
nents,  the  constituent  parts  of  which  are  control  logics  and 
controlled  physical  or  chemical  systems.  Modelling  mecha¬ 
tronic  systems  challenges  the  engineer  due  to  different 
physical  domains.  In  order  to  reach  the  goal  of  a  truly  uni¬ 
fied  description  of  system  architectures  comprising  Soft¬ 
ware  and  Hardware  systems,  the  description  or  modelling 
languages  of  mechatronic  systems  have  to  be  lifted  to  a 
similar  abstract  level  as  their  Software  counterparts. 

Model  based  techniques  play  an  important  role  in  con¬ 
current  and  future  engineering  processes.  Models  and  simu¬ 
lations  are  a  basis  for  system  design  and  analysis,  e.g.  for 
geometric  layout  of  hydraulic  systems.  On  the  other  hand, 
model  based  control  and  model  based  diagnosis  are  state  of 
the  art. 

Many  different  philosophies  have  been  developed  to  sup¬ 
port  the  modelling  task.  In  the  control  engineering  area 
tools  like  Matlab/Simulink  [1]  or  MatrixX  SystemBuild  [2] 


are  widespread.  For  modelling  mechanical  systems 
ADAMS  [3]  or  SIMPACK  [4]  are  frequently  used.  For 
electronic  systems  PSPICE  [5]  is  an  appropriate  tool.  Other 
specific  tools  are  used  to  solve  modelling  tasks  in  flow 
dynamics,  thermal  flow  or  chemical  processes.  Each  of 
these  programs  are  specially  tailored  for  the  specific 
domain. 

A  mechatronic  system  consists  of  a  control  logic,  elec¬ 
tronics  and  a  controlled  mechanical,  hydraulic  or  any  other 
physical  or  chemical  system.  The  entire  system  is  com¬ 
posed  of  subsystems  of  different  domains.  This  shows  the 
restriction  of  all  classical  modelling  systems,  since  the  con¬ 
trol  part  can  be  easily  described  for  example  in  Matlab/ 
Simulink,  but  it  is  nearly  impossible  to  model  an  electrical 
subsystem.  So,  a  method  is  needed  for  a  multidisciplinary 
modelling. 

Methods  and  tools,  e.g.  Omola,  Dymola  or  Smile,  have 
been  developed  which  allow  multidisciplinary  modelling. 
Modelica  [6],  [7]  is  the  latest  step  in  this  direction.  It  is  a 
standardized  object-oriented  modelling  language  which  is 
supported  by  the  tool  Dymola  [8]  for  example. 

Dymola/Modelica  comes  with  libraries  for  different 
physical  domains  like  electrics/electronics,  mechanics, 
thermal  flow  or  hydraulics,  see  Figure  1.  It  also  contains  a 
signal  block  and  a  Petri  net  library.  A  library  consists  of  a 
set  of  templates  for  different  physical  or  logical  objects. 
The  user  can  extend  a  library  for  example  by  inheritance  or 
can  create  completely  new  libraries.  A  model  is  described 
by  an  object  diagramm.  Most  tools  contain  a  graphical 
interface  with  a  simple  drag  and  drop  technique  for  the  tem¬ 
plates  and  interconnections  at  the  object  interfaces.  The 
interconnections  have  the  meaning  of  constraints.  More 
precisely,  two  types  of  equations  are  generated  when  two 
physical  objects  are  connected:  a  flow  and  a  potential  equa¬ 
tion.  With  the  definition  of  the  flow  and  the  potential  vari¬ 
ables,  the  energy  flow  in  the  interface  is  uniquely  defined. 
This  is  valid  for  all  lumped  parameter  systems.  The  great 
advantage  is  that  the  system  can  now  be  modelled  by  local 
behaviour  and  not  by  global  analysis  [9],  which  supports 
the  general  idea  of  modularity. 

Qualitative  Models  and  Bayesian  Networks 

Qualitative  modelling  offers  many  well-known  advan¬ 
tages  for  system  design,  diagnosis  or  verification,  see  [14] 
for  a  very  extensive  survey  of  techniques  and  applications. 
Some  of  these  advantages  are: 


Figure  1:  The  hydraulics  library  and  an  object  diagram  in  Dymola. 


•  handling  of  incomplete  and  imprecise  knowledge, 

•  robustness, 

•  easy  comparison  of  system  alternatives,  e.g.  parameters 
variations, 

•  direct  interpretation  of  simulation  results, 

•  complexity. 

Our  vision  is  an  object-oriented  method  using  Bayesian  net¬ 
works  for  modelling  physical  systems,  especially  system 
dynamics.  Bayesian  networks  are  a  well-suited  method  for 
handling  imprecise  knowledge  in  a  consistent  way.  Efficient 
learning  and  adaption  algorithms  are  known  for  Bayesian 
networks,  which  is  a  very  interesting  option  for  automatic 
model  calibration.  The  definition  of  a  Bayesian  network  BN 
is  as  follows:  BN  =  {DAG,  CPDs},  where  DAG  is  a  directed 
acyclic  graph,  consisting  of  nodes  and  directed  edges  or 
links,  and  CPDs  are  conditional  probability  distributions. 
The  nodes  in  a  Bayesian  network  represent  propositional 
variables  of  interest  (e.g.,  the  temperature  of  a  device).  The 
links  of  a  BN  represent  informational  or  causal  dependencies 
among  the  variables.  These  dependencies  are  quantified  by 
conditional  probabilties  (the  CPDs)  for  each  node  given  its 
parental  nodes  in  the  DAG.  We  do  not  cite  the  Bayesian  net¬ 
works  fundamentals  in  this  paper,  but  refer  to  the  relevant  lit¬ 
erature,  see  [10]  for  some  Bayesian  networks  basics  or  [11] 
for  an  excellent  textbook. 


Object-oriented  Bayesian  networks  were  introduced  in 
[13],  and  are  now  supported  by  the  newest  version  of  the 
commercial  Software  tool  HUGIN  [12]  for  example. 

Template  construction 

In  this  section  we  describe  the  conversion  steps  from 
Modelica  to  Bayesian  network  templates.  The  conversion 
will  proceed  in  four  major  steps.  First,  given  a  dynamic  com¬ 
ponent,  the  differential  equations  will  be  discretized  in  time 
using  Euler's  rule.  Second,  the  equation  part  of  a  Modelica 
template  will  be  reformulated  with  qualitative  operators. 
Third,  the  qualitativ  landmarks  have  to  be  chosen  for  each 
state  variable  and  each  parameter.  Fourth,  the  resulting  qual¬ 
itative  equations  will  be  graphically  programmed  with  Baye¬ 
sian  networks. 

An  fuel  reservoir  called  "VolumeConst"  will  serve  as  an 
example.  The  icon  used  in  the  Modelica  HyLib  library  [15] 
is  shown  in  Figure  2.  Note,  that  the  component  VolumeConst 
has  one  port  (portA)  and  that  the  flow  into  the  component 
has  a  positive  sign  .  PortA  can  be  viewed  as  a  real  physical 
flange  with  some  pressure  p  and  an  oil  flow  q.  The  behavior 
of  the  component  VolumeConst  is  described  in  Modelica  by 
the  equation  block.  Other  definition  blocks  like  the  graphi¬ 
cal,  interfaces  or  parameter  block  are  omitted. 


Figure  2:  The  Dymola  representation  for  a  fuel  reservoir. 


model  VolumeConst 

graphical  block 
interfaces  block 
parameter  block 
equation 

der(  portA.p)  =  beta/volume  *  portA.q; 
end  VolumeConst 

The  equation  block  consists  of  one  differential  equation 
with  beta  and  volume  being  fixed  parameters  defining  the 
effective  bulk  modulus  of  the  liquid  and  the  volume  in 
square  meters,  respectively.  After  chosing  a  time  step  h  the 
time  discrete  version  is  as  follows: 

model  VolumeConstDiscr 
equation 

lVh  (  portA.p(t)  -  portA.p(t-h)  )  = 

(beta/volume  *  portA.q(t) ) 
end  VolumeConstDiscr 

Next,  qualitative  operators  are  inserted. 

model  VolumeConstQual 
equation 

portA.p(t)  ©  -portA.p(t-h)  = 
const  ®  portA.q(t) 
end  VolumeConstQual 

Now  we  have  to  choose  a  quantity  space,  the  "landmarks", 
for  the  variables  and  parameters.  For  clearness,  we  choose  a 
three  valued  quantity  space  xe  {-,  0,  +}  for  all  variables  x . 
Some  qualitative  calculus  has  to  be  defined  for  the  chosen 
quantity  space.  Qualitative  addition  ©  for  the  three  valued 
quantity  space  can  be  defined  straightforward  [14]  as  in 
Table  1. 
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Table  1:  Qualitative  addition  defined. 


The  z  =  ?  entry  marks  the  ambiguity  of  the  result,  when 
the  ©  operator  is  applied  on  x  =  -  and  y  =  +  or  vice  versa, 
respectively. 

Now  the  Bayesian  network  template  for  VolumeConst  can 
be  constructed.  The  basic  idea  is  to  identify  each  qualitative 
variable  with  a  Bayesian  network  node,  the  qualititative  val¬ 
ues  with  the  states  of  this  node,  and  the  qualitative  calculus 
with  CPDs. 

We  give  an  example  for  the  ©  operator  applied  on  vari¬ 
ables  x  and  y  .  The  principle  Bayesian  network  is  shown  in 
Figure  3.  The  entries  in  the  CPD  table  in  Figure  3  are  proba¬ 
bilities,  where  each  column  sums  to  1.  The  z  =  ?  entries  in 
the  operator  table  can  be  represented  by  the  colomns  with  the 
uniform  distribution,  i.e.  1/3  for  each  entry  in  this  case. 

Any  other  algebraic  operation  can  also  be  reformulated  as 
a  Bayesian  network  fragment.  In  this  way  the  complete  tem¬ 
plate  for  VolumeConst  is  constructed.  The  result  is  shown  in 
Figure  4.  The  port  nodes,  which  correspond  to  the  port  vari¬ 
ables  in  portA  are  marked  with  a  rectangle.  When  the  Baye¬ 
sian  network  template  is  instantiated  in  a  system  model  only 
the  input  and  output  nodes,  i.e.  the  port  nodes,  are  visible. 
Note  that  this  is  a  dynamic  Bayesian  network,  because  node 
PAO  carries  the  state  of  the  pressure  at  time  slice  t-h  and  PA1 
the  state  at  time  slice  t.  The  difference  is  calculated  in  node 
dPOl. 


Figure  3:  Bayesian  network  fragment  and  the  CPD  for  the  ©  operator. 


Figure  4:  Bayesian  network  template  for  VolumeConst. 


What  is  missing  yet  are  the  constraint  templates,  which 
serve  as  connectors  between  components.  We  need  two  dif¬ 
ferent  templates,  one  expressing  that  there  is  equal  pressure 
at  two  connected  ports,  and  a  second  one,  expressing  that  the 
flows  sum  up  to  zero  at  a  hydraulic  node.  We  present  these 


two  templates  with  the  CPDs  in  Figure  5  and  Figure  6, 
respectively.  For  the  pressure  we  assume  three  values:  zero 
pressure  (0),  low  pressure  (+),  high  pressure  (++).  Note,  that 
the  arcs  are  directed  to  the  "inner"  constraint  node,  such  that 
the  resulting  Bayesian  network  model  is  always  acyclic.  In 
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Figure  5:  Bayesian  network  template  ZeroSumFlows2  for  the  flow  constraint. 
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Figure  6:  Bayesian  network  template  EqPressure2  for  the  pressure  constraint. 


Figure  5  and  Figure  6  we  show  the  simplest  scenario  that 
two  components  are  connected  in  series.  In  Figure  1,  on  the 
right  hand  side,  this  is  the  case  for  the  "ReliefValve"  and  the 
"LineToFilterResistance".  In  the  Bayesian  network  for  this 
tank  system,  the  ReliefValve-template  and  the  LineToFilter- 
Resistance-template  will  be  connected  via  the  flow  con¬ 
straint  and  the  pressure  constraint,  see  also  Figure  7.  Note 
that  the  table  entries  are  hard  0/1  decisions.  Before  propagat¬ 
ing  the  Bayesian  network,  the  "true"-state  of  the  sum_0  and 
the  eq_p  nodes  must  always  be  set  evident.  Doing  this,  the 
pressures  on  both  sides  are  forced  to  be  equal.  The  flow  con¬ 
straint  then  simply  states,  that  the  mass  flow  coming  out  of 
the  first  component  equals  the  mass  flow  into  the  second 
component.  In  the  general  case,  where  more  than  two  com¬ 
ponents  meet,  for  example  the  "LineToFilterResistance",  the 
"FilterResistance"  and  the  "ReliefValveFilter",  the  flow  con¬ 
straint  template  must  be  assembled  from  the  ©  -operation 
fragment,  see  Figure  3,  and  the  flow  constraint  template  of 
Figure  5.  This  new  object  is  then  called  ZeroSumFlows3  and 
is  shown  in  Figure  7. 

Results 

A  basic  library  for  constructing  simple  hydraulic  circuits 
has  been  developed.  It  contains  an  ideal  flow  source,  a  reser¬ 
voir,  a  hydraulic  resistance,  a  tank,  a  relief  valve,  a  real  flow 
source,  and  the  constraint  templates.  Differing  from  the  pre¬ 


viously  discussed  three  state  nodes,  each  dynamic  variable 
here  has  five  states.  We  used  the  object-oriented  Bayesian 
network  software  Hugin.  Currently,  only  discrete  valued 
nodes  are  used.  This  is  reasonable,  because  many  mecha- 
tronic  systems  are  hybrid  or  switching.  Discrete  valued 
nodes  allow  us  to  model  arbitray  dynamics,  whereas  contin¬ 
uous  valued  node  models  result  in  Kalman  models,  thus  lin¬ 
ear  models. 

We  will  shortly  discuss  the  hydraulic  library.  The  ideal 
flow  source  has  two  ports,  namely  A  and  B,  or  a  "positive" 
and  a  "negative"  port,  with  only  one  flow  variable  which  can 
be  controlled,  i.e.  set  evident.  A  real  flow  source  called  Real- 
FuelPump  is  derived  from  this  ideal  flow  source.  Addition¬ 
ally  it  contains  the  volume  model,  which  was  described 
above.  So  the  port  B  of  RealFlowPump  delivers  a  pump  flow 
and  a  pump  pressure.  The  tank  model  has  only  a  flow  vari¬ 
able  at  the  ports  A  and  B.  It  is  dynamic,  modelling  the 
change  of  the  fuel  volume  over  time. 

The  relief  valve  has  a  switching  behaviour  in  Modelica. 
Pressure  and  flow  is  specified  at  the  ports.  In  Modelica  the 
valve  logic  is  modelled  with  a  state  machine.  We  modelled 
this  valve  logic  with  a  Markov  model,  having  the  two  states 
"open"  and  "closed".  At  last,  the  hydraulic  resistance  has  two 
ports  specifying  pressure  and  flow.  It  models  laminar  flow, 
i.e.  the  pressure  drop  over  a  hydraulic  line. We  used  this  ele¬ 
ment  also  to  model  the  resistance  of  the  fuel  filter. 


Figure  7:  The  Bayesian  network  tank  system  model. 


We  present  a  little  hydraulic  circuit,  that  is,  a  fictitious 
tank  system.  This  system  was  first  modelled  for  reference  in 
Dymola/Modelica,  see  Figure  1  on  the  right  hand  side.  Then 
we  built  this  system  using  the  dynamic  Bayesian  network 
templates.  The  Bayesian  network  tank  system  is  shown  in 
Figure  7.  For  the  dynamic  simulation,  we  set  evident  all  con¬ 
straint  nodes  and  the  pump  flow.  All  other  nodes  are  hidden. 


that  is,  they  were  calculated  by  propagation.  The  results  for 
100  time  steps  are  shown  in  Figure  8,  Figure  9,  and 
Figure  10.  These  figures  show  the  evolution  of  the  probabil¬ 
ity  distributions.  The  darker  the  colour  bars  are,  the  higher 
the  probability.  We  added  mean  values  for  conveniance.  The 
plots  were  produced  using  the  Qualitative  Modelling  Tool¬ 
box  for  Matlab  SIMULINK  [16], 
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Figure  8:  The  pump  flow  (evident  node). 
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Figure  9:  The  flow  into  the  fuel  reservoir  VolumeConst  (inside  the  RealFuelPump  template  -  hidden  node). 


Conclusion  and  future  work 

In  this  contribution  we  have  motivated  the  need  for  intelli¬ 
gent  modelling  techniques.  For  system  design,  diagnosis  or 
verification  qualitative  models  are  a  very  good  choice.  We 
favor  the  Bayesian  network  technology  due  to  their  robust¬ 
ness,  intuitivity  and  practicability. 

We  seeked  a  qualitative  modelling  technique  for  mecha- 
tronic  systems,  i.e.  for  dynamic,  multidomain  systems.  The 
object-oriented  physical  modelling  technique  gave  us  the 
hint  for  the  construction  of  our  OODBNs.  The  simulation 
results  encourage  us  to  proceed  in  this  direction.  Recent  suc¬ 


cess  has  been  made  to  select  the  states  (the  "landmarks") 
upon  measurements  or  quantitative  simulations,  using  a  sim¬ 
ple  heuristic  from  system  identification.  Furthermore,  learn¬ 
ing  respectively  adapting  the  CPDs  using  HUGINs  adaption 
API  was  very  promising. 
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Abstract 

This  paper  introduces  a  concept  for  building  up 
distributed  monitoring  and  diagnostic  systems 
for  complex  industrial  applications.  The  diag¬ 
nostic  process,  from  accessing  sensor  data  up  to 
the  visualization  within  a  graphical  user  inter¬ 
face  is  described  by  universal  applicable  for¬ 
malisms.  Generic  mechanisms  were  identified  to 
improve  the  quality  of  a  diagnosis  by  integrating 
legacy  diagnostic  engines  and  handling  different 
diagnostic  mechanisms  in  parallel.  For  this  pur¬ 
pose,  a  modular  multi-agent  architecture  and  a 
set  of  development  tools  were  implemented. 

This  software  architecture  for  monitoring  and 
diagnosis  was  developed  within  the  framework 
of  the  EU  Esprit  Program:  ‘"DIAMOND:  Dis¬ 
tributed  Architecture  for  MONitoring  and  Diag¬ 
nosis”. 

1  Introduction 

To  compete  within  industry,  manufacturers  are  demanded 
to  optimize  their  productive  processes.  In  order  to 
achieve  this  efficiency,  a  high  value  has  to  be  set  on  the 
quality  of  the  industrial  equipment  as  well  as  on  the  in¬ 
dustrial  process  itself.  Monitoring  and  diagnostic  systems 
(M&D  systems)  support  this  objective  by  predicting  fail¬ 
ures,  or  if  a  failure  occurred,  by  identifying  the  reason 
for  this  fault.  Thereby  it  is  possible  to  reduce  down-time 
costs  of  the  production  process.  To  achieve  an  overall 
reduction  of  the  production  costs,  the  development  ex¬ 
pense  for  a  powerful  M&D  system  has  to  be  minimized. 
For  this  purpose,  a  modular  concept  was  realized  that  is 
based  on  a  distributed  multi-agent  approach. 

A  complex  industrial  application  is  built  by  a  set  of  dif¬ 
ferent  physical  units.  These  units  may  be  provided  by 
different  vendors,  having  detailed  knowledge  about  the 
behavior  of  their  unit.  Furthermore,  there  are  often  dif¬ 
ferent  diagnostic  methods  for  the  same  unit  available.  To 
realize  a  powerful  diagnosis,  the  knowledge  about  the 
different  physical  units  and  about  various  diagnostic 
methods  should  be  merged  together  within  an  overall 
framework.  Therefore,  new  strategies  were  required  to 


treat  these  supplementing  diagnostic  knowledge  about  an 
industrial  process. 

This  paper  presents  the  generic  aspects  of  the  underlying 
infrastructure  and  describes  the  multi-agent  framework. 
It  is  neither  deemed  to  explain  any  diagnostic  algorithms 
that  may  be  applied  nor  to  present  the  methodologies  in 
detail  that  are  employed  to  handle  different  diagnostic 
results  in  parallel. 

2  Related  work 

Interest  in  recent  research  on  distributed  approaches  for 
diagnostic  purposes  can  currently  be  seen  in  Europe, 
Japan  and  in  the  United  States.  A  general  overview  about 
distributed  artificial  intelligence  in  industry  is  given  in 
[Par94],  This  paper  reviews  the  industrial  needs  for  Dis¬ 
tributed  Artificial  Intelligence,  giving  special  attention  to 
systems  for  manufacturing,  scheduling  and  control.  It 
gives  case  studies  of  several  advanced  research  applica¬ 
tions,  actual  industrial  installations  and  identifies  steps, 
need  to  be  taken  to  deploy  these  technologies  more 
broadly. 

In  [Fro96]  there  is  a  distinction  between  semantically 
distributed  diagnosis  and  spatially  distributed  diagnosis. 
Semantically  distributed  diagnosis  refers  to  a  heteroge¬ 
neous  group  of  agents,  in  which  each  agent  has  its  own 
view  of  the  system.  This  can  either  mean  that  each  agent 
focuses  on  a  specific  area  of  the  system  or  that  the  agents 
model  different  aspects  of  the  system  or  use  different 
diagnostic  methods,  e.g.  one  agent  models  the  structure 
of  the  system  and  another  one  models  the  performance. 
Spatially  distributed  diagnosis  refers  to  a  group  of  agents 
which  jointly  monitor  and  diagnose  a  spatially  distrib¬ 
uted  system  with  relationships  between  those  equip¬ 
ments.  Each  agent  has  detailed  knowledge  about  a  small 
part  of  the  system. 

Different  concepts  to  realize  a  distributed  approach  are 
proposed  in  the  literature,  ranging  from  classical  client 
server  application  over  blackboard  technologies  to  a  few 
multi-agent  frameworks.  Most  distributed  applications 
employ  a  classical  client  server  approach  with  distributed 
clients,  communicating  with  a  central  server.  This  well 
known  technology  is  continuously  advanced  and  applied 
to  distributed  management  applications.  The  standardi- 


zation  of  distributed  management  tasks  like  information 
exchange,  monitoring  and  diagnostics  is  aimed  by  the 
Common  Diagnostic  Model  (CDM),  developed  within 
the  “distributed  management  task  force”  (DMTF) 
[DMTF].  This  framework  is  based  on  software  clients 
which  perform  their  tasks  nearly  automatically  and  report 
information  to  a  central  server. 

Another  concept  is  based  on  a  blackboard  technology.  A 
central  blackboard  is  used  to  store  all  available  informa¬ 
tion  about  an  industrial  process.  These  data  can  be  ac¬ 
cessed  by  other  software  units,  possibly  software  agents, 
in  order  to  perform  a  diagnosis  or  to  visualize  the  results. 
[Lee97]  outlines  the  conceptual  foundations  for  next 
generation  industrial  remote  diagnostics  and  product 
monitoring  systems.  It  extends  the  multi-agent  frame¬ 
work  research  to  include  new  classes  of  product  popula¬ 
tion  and  diagnostic  agents  within  a  distributed  Embedded 
Web  and  Electronic  Commerce  infrastructure.  The  prod¬ 
ucts  to  be  diagnosed  are  for  example  printers,  copiers 
and  vehicles. 

[Ben97]  introduces  a  KQML-CORBA  (Knowledge 
Query  and  Manipulation  Language)  [KQML]  based  ar¬ 
chitecture  to  implement  a  multi-agent  system  for  network 
and  service  management.  The  paper  adopts  this  archi¬ 
tecture  and  applies  it  to  diagnosis  and  monitoring  of  ma¬ 
chines  and  components  in  the  production  environment  by 
using  a  multi-agent  approach,  combined  with  semanti¬ 
cally  distributed  diagnosis.  Up  to  now  there  are  only  few 
other  implementations  of  KQML,  one  over  TCP/IP  in  C 
which  has  been  developed  Lockheed-Martin  [Fin99]  and 
one  in  JAVA  [Fro96],  The  use  of  the  agent  communica¬ 
tion  language  developed  by  FIPA  [FIPA98]  together  with 
the  CORBA  standard  for  distributed  computing  is  a 
widely  used  strategy  to  realize  an  agent  interaction 
[Orf97], 

3  Units  of  Monitoring  and  Diagnostic 
System 

High  value  and  cost  effective  diagnosis  system  can  be 
developed  by  distinguishing  between  the  tasks  that  have 
to  be  performed  within  a  M&D  system.  Some  tasks  are 
general  for  all  monitoring  and  diagnostic  systems,  some 
are  specific  for  the  employed  production  equipment  and 
others  are  specific  for  the  considered  production  process. 
These  three  units  should  to  be  treated  in  different  ways: 

3.1  Generic  M&D  units 

Independent  to  specific  requirements  of  an  application 
area,  a  set  of  tasks  are  generic  for  all  monitoring  and 
diagnostic  systems.  Functionalities  like  interfacing  dif¬ 
ferent  units,  storing  data,  formatting  measurements  and 
diagnosis  results,  configuring  the  system,  managing  the 
interaction  and  some  more  have  to  be  performed  in  all 
M&D  systems.  It  is  obviously  feasible  to  reuse  a  set  of 
existing  software  units  whenever  a  specific  Monitoring 
and  Diagnostic  system  has  to  be  built.  This  allows  the 
integration  of  highly  developed  and  well  tested  units  ex¬ 


tremely  fast  and  low  priced.  Together  with  a  set  of  well 
defined  interfaces,  a  general  architecture  for  monitoring 
and  diagnosis  becomes  realized. 

3.2  Specific  to  production  equipment 

Some  parts  of  monitoring  and  diagnostic  software  are 
specific  to  the  used  production  equipment.  The  way  how 
to  access  sensor  values  and  how  to  use  these  data  to  di¬ 
agnose  a  physical  component  is  specific  to  the  produc¬ 
tion  environment.  The  component  manufacturer  is  inter¬ 
ested  to  equip  its  component  with  monitoring  and  diag¬ 
nostic  capabilities  that  are  compatible  with  a  generic 
architecture.  The  usage  of  software  agents  allows  to  en¬ 
capsulate  legacy  diagnostic  tools  in  order  to  become  in¬ 
teroperable  to  the  overall  system  (see  section  5).  These 
parts  are  reusable  whenever  the  same  production  equip¬ 
ment  is  used  and  should  not  be  developed  each  time  a 
M&D  system  is  built  from  scratch. 

3.3  Specific  to  production  process 

All  remaining  parts  of  the  complete  monitoring  and  di¬ 
agnosis  system  are  specific  to  the  production  process. 
The  interaction  between  different  physical  components, 
and  adoptions  of  the  operator  interface  are  examples  for 
units  that  have  to  be  develop  individually. 

4  DIAMOND  -  distributed  multi¬ 
agent  architecture 

The  architecture,  proposed  in  this  paper,  tries  to  merge 
the  three  different  parts  to  a  consistent  monitoring  and 
diagnosis  system.  All  generic  units  of  a  M&D  system  are 
realized  by  the  utilization  of  software  agents,  each  re¬ 
sponsible  for  a  specific  task.  An  integration  of  all  parts 
that  are  specific  to  the  production  environment  or  to  the 
production  process  itself  is  supported  by  encapsulating 
these  functionalities  into  different  agents,  able  to  interact 
with  the  overall  framework.  The  DIAMOND  architecture 
specifies  the  information  that  are  required  to  integrate 
these  parts  and  supports  the  development  process  by  of¬ 
fering  a  set  of  tools  (see  section  5). 

The  structure  and  the  interaction  of  the  software  agents 
that  are  able  to  perform  all  generic  tasks  and  that  are 
used  to  encapsulate  all  specific  tasks  are  described  in  the 
following  chapter. 

4.1  Structure  of  DIAMOND  Agent 

All  agents  that  are  used  within  DIAMOND  are  built  of 
two  different  software  units.  One  is  responsible  for  the 
communication  with  other  DIAMOND  agents  within  the 
M&D  framework.  This  unit  is  called  ‘Wrapper’  and  is 
identical  for  all  agents.  The  second  software  unit  is 
called  ‘Brain’.  This  part  performs  all  tasks  that  are  spe¬ 
cific  for  the  type  of  the  agent. 

The  Wrapper  is  responsible  for  handling  the  communi¬ 
cation  of  this  agent  with  other  agents  in  the  M&D  system 
by  applying  the  CORBA  middleware.  Information  can  be 
exchanged  by  applying  CORBA  events  or  by  using 


CORBA  call-backs.  The  wrapper  registers  its  CORBA 
objects  at  the  ORB  to  become  visible  for  other  agents. 
However,  the  wrapper  does  not  initiate  or  analyze  any 
information  exchange  with  another  agent  by  itself.  This 
is  handled  by  the  brain  part. 

The  agents  brain  performs  all  tasks  that  are  specific  to 
the  type  of  the  agent.  Some  of  the  agents  that  are  utilized 
within  the  framework  perform  exclusively  generic  tasks 
that  are  similar  for  all  industrial  environments.  The  im¬ 
plementation  of  the  brain  for  these  agents  is  uniform  for 
each  application  where  DIAMOND  is  applied.  The  tasks 
that  have  to  be  performed  by  the  monitoring  and  diag¬ 
nostic  agents  (see  next  chapter)  are  partly  independent 
on  the  application.  One  part  handles  generic  monitoring 
or  diagnostic  capabilities.  This  part  of  the  monitoring 
agent  brain  and  the  diagnostic  agent  brain  is  provided  by 
the  DIAMOND  development  toolkit  (see  section  5.4  and 
section  5.5).  All  further  capabilities  depend  on  the  spe¬ 
cific  application  and  have  to  be  implemented  afterwards 
individually  by  accessing  well  defined  interfaces.  This 
absence  of  an  explicit  implementation  enables  the  inte¬ 
gration  of  legacy  monitoring  and  diagnostic  tools  inside 
the  brain  of  a  monitoring  agent  or  inside  a  diagnostic 
agent. 

4.2  Agent  interaction 

The  Wrapper  of  each  agent  uses  CORBA  to  exchange 
messages  with  other  agents.  It  supports  a  synchronous 
call-back  communication  and  an  asynchronous  event- 
based  communication. 

The  wrapper  exposes  a  CORBA  remote  interface  to  other 
agents  that  can  use  it,  whenever  they  want  to  send  a  mes¬ 
sage  to  this  agent  by  using  a  CORBA  call-back.  The 


cure,  if  an  unknown  protocol  was  received,  if  the  agent  is 
to  busy  to  handle  the  message  or  if  the  agent  will  con¬ 
tinue  to  process  the  message.  Only  in  the  last  case,  the 
message  will  be  stored  in  an  internal  buffer  that  is  part  of 
the  Wrapper.  This  buffer  allows  to  sort  incoming  (and 
outgoing)  messages  according  to  their  priority,  their 
timestamp  or  in  respect  to  a  given  time-out.  The  brain 
removes  the  message  from  the  buffer  as  soon  as  it  is  able 
to  process  it.  After  processing  the  message,  it  specifies 
the  answer,  corresponding  to  the  used  protocol.  This  an¬ 
swer  is  stored  in  the  internal  buffer  of  the  agent  and  for¬ 
warded  to  the  remote  CORBA  object. 

The  message  that  is  sent  by  an  agent  contains  a  set  of 
pre-defined  parameters  as  it  is  specified  by  FIPA.  These 
parameters  are  stored  within  a  XML  structure.  Since  a 
conversation  must  not  only  handle  information  exchange 
but  also  the  exchange  of  attitude  about  the  information,  a 
2-layered  protocol  is  applied.  The  outer  layer  of  a  mes¬ 
sage  represents  the  attitude  about  the  information.  These 
data  are  processed  by  the  Wrapper.  The  information  it¬ 
self  is  part  of  an  inner  layer  which  is  stored  in  the  con¬ 
tent  parameter  of  a  message.  The  message  content,  which 
is  also  encoded  in  the  XML  syntax,  is  processed  by  the 
brain  of  the  agent. 

If  an  agent  wants  to  supply  data  quickly  to  the  overall 
multi-agent  framework  without  taking  care  about  the  re¬ 
ceiving  agents,  an  asynchronous  event-based  communi¬ 
cation  is  more  feasible.  This  mechanism  is  mainly  used 
by  the  monitoring  agents  to  supply  measurements  to  all 
diagnostic  agents  that  are  interested  in.  Every  agent  is 
able  to  supply  and  to  consume  events,  structured  in 
XML,  by  connecting  to  different  event  channels.  This 
CORBA  functionality  is  accessed  by  the  Wrapper. 


Figure  1  Agent  Interaction 


FIPA-Agent  Communication  Language  (FIPA-ACL) 
[FIPA98]  is  used  to  restrict  the  interaction  between 
communicating  agents  (see  figure  l).The  REQUEST,  the 
QUERY-REF  the  SUBSCRIBE  and  the  CANCEL  proto¬ 
col  were  identified  to  cover  all  required  agent  interac¬ 
tions.  The  call-back  CORBA  concept  allows  the  receiv¬ 
ing  agent  to  return  an  integer  value  instantly  if  the  struc¬ 
ture  of  the  message  is  invalid,  if  the  message  is  not  se- 


4.3  Monitoring  Agent 

The  interface  between  the  physical  state  of  the  industrial 
application  and  the  DIAMOND  system  is  realized  by  a 
Monitoring  Agent.  This  type  of  agent  handles  the  meas¬ 
urements  of  the  physical  components  and  prepares  them 
to  be  treatable  by  other  agents  within  the  framework. 
Each  Monitoring  Agent  has  to  be  adjusted  to  the  sensors 


of  the  industrial  equipment  that  will  provide  the  meas¬ 
urements.  Furthermore,  the  Monitoring  Agents  are  able 
to  initiate  a  diagnosis  of  a  component  as  soon  as  they 
have  identified  an  irregular  state  of  a  measurement. 

4.4  Diagnostic  Agent 

Different  aspects  of  distribution  are  handled  within  the 
DIAMOND  framework.  First  of  all,  the  different  tasks 
that  have  to  be  performed  within  the  monitoring  and  di¬ 
agnostic  process  are  distributed  to  different  agent  types, 
each  responsible  for  its  specific  task.  The  task  that  has  to 
be  performed  by  the  Diagnostic  Agents  is  also  distrib¬ 
uted  again.  The  Diagnostic  Agents  are  handling  the 
measurements  that  are  provided  by  the  Monitoring 
Agents  to  identify  the  functional  state  of  the  physical 
components.  This  diagnosis  may  be  performed  by  differ¬ 
ent  Diagnostic  Agents,  each  having  a  different  view  of 
the  industrial  application. 

•  This  variation  may  be  related  with  different  temporal 
aspects  of  the  behavior  of  the  plant  (temporal  distri¬ 
bution). 

•  Often,  there  are  different  diagnostic  algorithms 
available  to  identify  the  state  of  an  industrial  proc¬ 
ess.  A  development  tool  for  a  flexible  M&D  system 
has  to  be  able  to  handle  various  diagnostic  mecha¬ 
nisms  in  parallel.  This  is  identified  as  a  semantical 
distribution  of  the  diagnosis. 

•  The  entire  diagnostic  knowledge  about  the  behavior 
of  the  plant  is  split  to  a  set  of  smaller  knowledge 
units,  each  associated  with  a  physical  part  of  the 
plant,  called  component.  A  single  Diagnostic  Agent 
does  not  know  about  the  behavior  of  the  complete 
plant,  but  about  a  single  component.  This  knowledge 
may  be  provided  by  the  manufacturer  of  the  compo¬ 
nent.  In  this  manner,  the  diagnostic  task  is  spatially 
distributed. 

When  distributing  the  overall  diagnostic  task  regarding 
temporal,  semantical  and  spatial  aspects,  a  flexible  and 
clear  framework  is  feasible.  For  diagnosing  the  overall 
process,  the  various  diagnostic  results,  reported  by  dif¬ 
ferent  Diagnostic  Agents  have  to  be  merged  together. 
This  additional  task  is  performed  by  the  Conflict  Reso¬ 
lution  Agent. 

4.5  Conflict  Resolution  Agent 

A  conflict  resolution  mechanism  is  required  to  investi¬ 
gate,  whether  the  diagnostic  results,  reported  by  different 
Diagnostic  Agents  are  contradicting  or  completing  each 
other.  The  Diagnostic  Agents  do  not  communicate  with 
each  other  to  merge  their  knowledge,  but  do  report  their 
diagnosis  to  a  Conflict  Resolution  Agent.  According  to 
the  different  types  of  distribution,  temporal,  semantical 
and  spatial  conflicts  have  to  be  considered.  For  this  pur¬ 
pose,  the  relations  between  the  components  and  between 
the  possible  failures  which  may  be  related  within  the 
components  have  to  be  well  known  (section  5.1  and  5.3). 
The  knowledge  is  represented  by  a  Graph.  An  adjacent 
vector,  where  each  element  represents  a  component  is 


used  to  build  the  graph  [ALG94].  Each  node  (compo¬ 
nent)  consists  of: 

•  Vector  of  topological  arcs  with  other  components 

•  Vector  of  relationships  between  the  same  failure  in 
different  components 

•  Vector  of  relationships  between  different  failures  in 
different  components. 

The  overall  conflict  resolution  process  is  divided  into 
different  sequential  steps: 

•  The  reported  failures  are  assigned  to  the  nodes 
(components)  of  the  graph  conformed  by  the  infor¬ 
mation  specified  in  the  structural  knowledge  base 
(chapter  5.1).  This  allows  to  identify  semantical  con¬ 
flicts. 

•  Following,  spatial  and  temporal  conflicts  are  investi¬ 
gated  in  three  different  levels.  Level  1  works  with 
the  topological  information  specified  in  the  struc¬ 
tural  knowledge  base,  level  2  works  with  the  rela¬ 
tions  between  the  same  failure  in  different  compo¬ 
nents  (i.e.  similar  to  level  1  but  specific  for  a  failure) 
and  level  3  works  with  the  relations  between  differ¬ 
ent  failures  in  different  components. 

Details  about  these  generic  algorithms  for  handling  tem¬ 
poral,  semantical  and  spatial  conflicts  can  be  found  in 
[DIAMOND], 

4.6  Facilitator  Agent 

The  Facilitator  Agent  is  responsible  for  networking  and 
mediating  between  the  agents  in  the  Multi-Agent  frame¬ 
work.  Large  industrial  applications  may  be  federal  and 
hierarchical  structured.  This  structure  is  adopted  to  dif¬ 
ferent  “domains”.  A  domain  is  a  subsystem  of  the 
DIAMOND  architecture  that  is  responsible  for  a  part  of 
the  industrial  application.  Each  “domain”  is  associated 
with  a  facilitator  agent  to  facilitate  the  networking  within 
this  domain  and  with  other  Lacilitator  agents  of  other 
domains.  Thus  a  diagnosis  of  a  single  domain  as  well  as 
a  diagnosis  of  the  complete  industrial  application  is  fea¬ 
sible.  Lurthermore,  the  Lacilitator  agent  is  the  mediator 
to  the  Graphical  User  Interface  Agent. 

4.7  Blackboard  Agent 

All  diagnosis  results  that  were  reported  within  a  well 
defined  timeframe  are  stored  in  a  blackboard  that  is  im¬ 
plemented  in  a  Blackboard  Agent.  Each  domain  has  its 
own  Blackboard  Agent  that  is  mediating  with  the  Con¬ 
flict  Resolution  Agent.  The  Blackboard  Agent  provides 
the  results,  reported  by  the  Diagnostic  Agents  and  trig¬ 
gers  the  conflict  resolution  process.  The  resolved  diag¬ 
nostic  result  that  cover  the  state  of  all  components  that 
are  part  of  the  domain  are  forwarded  to  the  Lacilitator 
Agent.  The  Blackboard  Agent  is  also  in  charge  of  storing 
all  reported  diagnostic  results  permanently. 

4.8  Graphical  User  Interface  Agent 

The  Graphical  User  Interface  Agent  is  the  human  gate¬ 
way  to  the  DIAMOND  system.  The  operator  uses  this 


interface  to  get  information  about  the  state  of  the  indus¬ 
trial  application,  to  provide  human  accessible  informa¬ 
tion  to  the  Diagnostic  Agent  and  to  initiate  diagnostic 
processes. 

4.9  Overall  Architecture 

The  hierarchical  and  federal  structure  of  the  industrial 
environment  that  has  to  be  monitored  and  diagnosed  is 
transferred  to  a  hierarchical  and  federal  structure  of  the 
software  architecture.  For  this  purpose,  industrial  com¬ 
ponents  are  grouped  together  to  form  a  logical  superior 
unit.  A  set  of  agents  that  are  responsible  for  diagnosing 
this  set  of  components  are  grouped  within  a  “domain”. 
Only  the  Facilitator  Agent  of  each  domain  is  able  to 
communicate  with  other  Facilitator  of  other  domains  or 
with  the  Graphical  User  Interface  Agent. 

The  main  concepts  of  this  DIAMOND  architecture  are 
summarized  in  figure  2. 


provide  measurements  that  may  be  used  by  the  Diagnos¬ 
tic  Agents  of  different  domains. 

5  How  to  build  a  monitoring  and  di¬ 
agnosis  system 

This  chapter  describes  the  steps  that  are  required  to  build 
up  a  complete  monitoring  and  diagnosis  system  by  using 
the  results  provided  by  the  DIAMOND  architecture. 

5.1  Identify  semantic  structure  of  the  in¬ 
dustrial  application 

The  first  step  while  building  up  a  monitoring  and  diagno¬ 
sis  system  is  to  define  all  physical  components  and  their 
relations  of  the  automated  industrial  system  that  have  to 
be  supervised.  There  should  be  no  overlap  between  com¬ 
ponents,  nor  should  there  be  “white  spaces”  of  the  sys¬ 
tem  being  diagnosed  not  covered  by  a  component  at  all. 
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Figure  2  DIAMOND  Architecture 
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All  DIAMOND  agents  that  are  interacting  are  pictured  as 
two  colored  boxes  with  the  type  of  the  agent  written  in¬ 
side.  The  light  green  box  indicates  the  Wrapper  that  is 
responsible  for  communication.  This  part  is  unique  for 
all  agents.  The  second  box  represents  the  Brain  which  is 
specific  for  the  agents  type.  The  agents  are  interacting  by 
using  the  Object  Request  Broker  (CORBA).  This  mid¬ 
dleware  is  pictured  as  the  yellow  bar  in  the  middle  of  the 
figure. 

The  figure  indicates  three  different  domains  (Domain  A, 
B  and  C).  These  domains  are  grouping  the  Diagnostic 
Agents,  a  Blackboard  Agent,  a  Conflict  Resolution  Agent 
and  a  Facilitator  Agent  together.  The  Monitoring  Agents 
are  not  associated  to  one  single  domain.  They  are  able  to 


The  components  of  the  industrial  application  may  be  hi¬ 
erarchical  or  federal  related  with  each  other.  If  there  is  a 
set  of  components  that  are  building  a  logical  unit  which 
is  widely  self-contained,  they  have  to  be  grouped  to¬ 
gether  into  a  domain.  The  knowledge  about  the  compo¬ 
nents  and  domains  is  fixed  for  a  specific  industrial  appli¬ 
cation  and  will  not  change  during  runtime.  DIAMOND 
provides  an  ontology  that  defines  the  structure  and  the 
possible  attributes  of  any  component.  This  knowledge  is 
stored  in  the  structural  knowledge  base  in  the  XML  for¬ 
mat. 

The  possible  relations  between  components  are  ex¬ 
pressed  by  the  attributes  of  each  “COMPONENT”  ele¬ 
ment: 


•  INPUT_CONNECTED_TO  specifies  functional  or 
logical  output  of  another  component  which  effects 
the  behavior  of  this  component. 

•  EXCLUSIVE  identifies  that  the  faults  of  both  related 
components  are  mutual  to  each  other. 

•  BELONGING_TO  is  used  to  express  the  topological 
relation  between  “parent”  and  “child”  components. 

Further  attributes  are  the  certainty  of  the  identified  rela¬ 
tion  and  a  possible  time  delay  which  describes  the  tem¬ 
poral  behavior  of  related  components. 

5.2  Identify  measurements 

It  has  to  be  investigated  whether  there  are  any  existing 
monitoring  sources  available  which  are  able  to  access 
measurable  states  of  the  plant.  These  sources  may  be 
accessed  by  a  Monitoring  Agent.  It  has  to  be  investi¬ 
gated,  which  information  about  the  measurable  state  of 
the  industrial  application  is  accessible  and  how  to  obtain 
them.  In  the  case  of  integrating  a  legacy  application  for 
accessing  the  system  variables,  the  interface  to  these  ap¬ 
plications  have  to  be  identified.  All  measurable  states 
that  are  practical  for  a  diagnosis  have  to  be  described  as 
a  measurement  according  to  a  well  defined  ontology  for 
MEASUREMENT. 

All  measurements  that  will  be  used  have  to  be  associated 
with  a  Monitoring  Agent  that  is  able  to  access  it.  It  is 
reasonable  to  associate  all  measurements  that  are  pro¬ 
vided  by  a  single  data  acquisition  source  or  by  a  specific 
mechanism  how  to  access  them  with  the  same  Monitoring 
Agent. 

5.3  Identify  failure  modes 

The  M&D  system  is  able  to  identify  those  potential  faults 
of  the  industrial  application  that  were  specified  in  ad¬ 
vance.  Therefore,  it  has  to  be  investigated,  which  legacy 
diagnostic  tools  may  be  used  and  which  faults  these 
modules  are  able  to  detect. 

Attributes  for  each  failure  are  specifying  the  conditions 
that  have  to  be  fulfilled,  the  names  of  rules  that  are  fea¬ 
sible  to  identify  the  fault  and  potential  recovery  actions. 
Another  attribute  identifies  whether  the  occurrence  of 
this  failure  is  related  with  another  failure,  either  in  the 
same  or  in  another  component.  This  allows  to  state 
whether  different  faults  are  contradicting  or  comple¬ 
menting  each  other,  how  they  are  temporally  related  and 
how  a  failure  propagates. 

All  these  information  are  stored  within  an  XML  struc¬ 
ture.  These  data  are  mainly  processed  by  the  Conflict 
Resolution  Agent  to  solve  diagnostic  conflicts.  The  diag¬ 
nostic  mechanisms  and  algorithms  to  identify  the  failure 
are  specific  to  the  industrial  process  and  will  be  used  by 
the  diagnostic  agents. 

5.4  Build  Monitoring  Agents  and  connect 
with  industrial  environment 

For  building  the  Monitoring  Agents,  a  shell  is  provided 
by  DIAMOND  that  enables  the  creation  of  this  agent. 
The  connection  with  the  monitoring  and  diagnostic  infra¬ 


structure  is  automatically  realized.  The  connection  with 
the  sensors  of  the  industrial  application  has  to  be  done 
afterwards.  This  task  is  specific  for  each  application  and 
for  each  sensor. 

There  are  several  predefined  configuration  parameters 
for  a  Monitoring  Agent.  These  parameters  may  be  set  by 
using  a  DIAMOND  toolkit. 

5.5  Build  Diagnostic  Agents  and  connect 
with  diagnostic  engines 

The  measurements  are  used  by  the  Diagnostic  Agents  to 
perform  a  diagnosis.  For  this  purpose,  each  Diagnostic 
Agent  needs  to  have  a  diagnostic  engine.  This  may  be  a 
commercial  expert  system  or  any  other  kind  of  diagnostic 
engine  to  identify  failures  of  the  related  component.  The 
connection  of  the  diagnostic  engine  with  the  M&D  sys¬ 
tem  is  realized  by  using  a  development  shell  for  creating 
the  Diagnostic  Agents.  The  interface  to  the  diagnostic 
engine  has  to  be  implemented  afterwards  individually. 
There  are  only  two  methods  that  have  to  be  implemented 
for  interfacing: 

One  method  enables  the  diagnostic  engine  to  get  a  meas¬ 
urement  value  which  is  accessed  from  an  internal  buffer 
of  the  Diagnostic  Agent.  The  engine  does  not  keep  care 
where  to  get  the  measurement  from.  All  measurements 
that  were  identified  in  chapter  5.2  are  accessible.  After 
the  engine  has  performed  its  diagnosis,  it  provides  the 
diagnosis  result  to  the  Wrapper  of  the  agent  by  using  the 
second  method  of  the  interface.  The  Wrapper  makes  the 
result  available  for  the  infrastructure  for  further  proc¬ 
essing. 

Integrating  the  diagnostic  engine  of  a  legacy  diagnostic 
tool  is  possible,  if  this  clear  interface  is  realized.  No 
further  modifications,  neither  to  the  DIAMOND  frame¬ 
work,  nor  to  the  diagnostic  engine  are  required. 

6  Evaluation  Examples 

The  functionality  of  the  presented  multi-agent  architec¬ 
ture  was  verified  by  integrating  a  specific  monitoring  and 
diagnosis  system  into  two  operational  prototypes. 

The  first  was  concerned  with  the  functional  process  of  an 
automated  welding  cell,  containing  a  control  system,  a 
robot  with  gas-metal  arc  welding  equipment  and  a  posi¬ 
tioning  device.  To  simulate  faults  that  may  occur  in  the 
welding  system,  a  simulator  was  used  that  emulates  the 
behavior  of  the  welding  equipment  for  different  faulty 
situations.  The  measurements  were  accessed  by  using  an 
ODBC  interface  and  a  DCOM  interface.  Several  legacy 
case  based  reasoning  engines,  each  responsible  for  an¬ 
other  component,  were  applied  to  identify  faulty  compo¬ 
nents.  This  integration  was  suitable  to  present  the  capa¬ 
bility  to  integrate  different  data  accessing  methods  and 
various  diagnostic  engines  within  an  integrative  moni¬ 
toring  and  diagnostic  system  easily.  This  M&D  system 
was  able  to  identify  spatial  conflicts  and  recognize  the 
propagation  of  faults  from  one  component  to  another 
one. 


The  second  evaluation  example  took  an  existing  expert 
system  for  diagnosing  the  water-steam  cycle  chemistry  of 
a  coal  fired  power  plant  (called  SEQA,  based  on  G2, 
Gensym)  and  re-worked  it  to  operate  in  a  modern  diag¬ 
nostic  framework.  To  verify  the  behavior  of  the  M&D 
system  outside  the  power  plant,  a  simulator  based  on  a 
neural  network  model  was  used  to  either  generate  offline 
artificial  anomalies  overlapped  to  normal  patterns  or  on¬ 
line  to  provide  a  set  of  normal  behavior  values  against 
which  measurements  should  be  compared.  The  assimila¬ 
tion  of  a  complete  legacy  expert  system  into  a  distrib¬ 
uted  M&D  framework  illustrated  a  complex  tasks  since 
there  were  many  interfaces  necessary  for  accessing  data 
and  for  using  the  legacy  diagnostic  engines. 

7  Conclusion 

This  paper  describes  a  concept  for  building  a  distributed 
architecture  for  monitoring  and  diagnosing  a  complex 
industrial  application.  The  presented  M&D  system  uses  a 
multi  agent  approach  which  warrants  the  flexibility,  the 
extendibility  and  a  cost  effective  development  of  the 
system. 

One  main  extension  to  existing  solutions  is  the  possibil¬ 
ity  to  integrate  legacy  diagnostic  tools  into  the  overall 
diagnosis  system.  This  requires  an  extensive  and  exact 
specification  of  all  components,  measurements  and  pos¬ 
sible  failures  of  the  industrial  application  as  well  as  a 
specification  of  their  relations  to  each  other.  This  was 
realized  by  introducing  a  set  of  ontologies  for  the  moni¬ 
toring  and  diagnostic  system. 

Furthermore,  several  diagnostic  engines  can  be  utilized 
in  parallel.  They  may  refer  to  different  components  of 
the  industrial  application  and  they  may  apply  different 
diagnostic  mechanisms.  By  using  different  Diagnostic 
Agents,  related  to  different  components,  the  diagnostic 
knowledge  can  be  provided  by  the  component  manufac¬ 
turer.  For  applying  different  diagnostic  methods,  algo¬ 
rithms  were  developed  to  handle  different  diagnosis  re¬ 
sults  in  parallel  and  to  investigate  whether  they  are  com¬ 
pleting  or  contradicting  each  other. 
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Using  Supervised  Learning  Techniques  for  Diagnosis  of 

Dynamic  Systems 
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Abstract.  This  paper  describes  an  approach  based  on  supervised 
learning  techniques  for  the  diagnosis  of  dynamic  systems.  The 
methodology  can  start  with  real  system  data  or  with  a  model  of 
the  dynamic  system.  In  the  second  case,  a  set  of  simulations  of 
the  system  is  required  to  obtain  the  necessary  data.  In  both  cases, 
obtained  data  will  be  labelled  according  to  the  running  conditions 
of  the  system  at  the  gathering  data  time.  Label  indicates  the 
running  state  of  system:  correct  working  or  abnormal  functioning 
of  any  system  component.  After  being  labelled,  data  will  be 
treated  to  add  additional  information  about  the  running  of  system. 
The  final  goal  is  to  obtain  a  set  of  decision  rules  by  applying  a 
classification  tool  to  the  set  of  labelled  and  treated  data.  This 
way,  any  observation  on  the  system  will  be  classified  according 
to  those  decision  rules,  having  a  return  label  indicating  the 
currently  running  state  of  system.  Returned  label  will  be  the 
diagnostic.  This  entire  learning  task  is  carried  out  off-line,  before 
the  diagnosing. 

1  INTRODUCTION 

Diagnosis  determines  why  a  system,  correctly  designed,  doesn't 
work  like  it  was  expected.  Explanation,  for  this  erroneous 
behaviour,  represents  a  discrepancy  with  the  system  design.  One 
diagnosis  task  is  to  determine  the  system  elements  that  could  cause 
the  erroneous  behaviour  according  to  the  system  observations. 
Monitoring  process  is  fundamental  to  avoid  non-real  faults  by 
small  alterations  in  variables  values.  [1]  Proposes  a  knowledge 
model  for  dynamic  systems  monitoring. 

Fault  detection  consists  on  determining,  starting  from  the 
system  observations,  when  an  incorrect  operation  of  the  observed 
system  exists.  When  failure  is  detected  then  diagnosis  will  take  the 
control  to  find  the  reasons  of  that  incorrect  behaviour. 

Fault  detection  and  diagnostic  of  faulty  components  are  very 
important  from  the  strategic  point  of  view  of  the  companies,  due  to 
the  economic  demands  and  environment  conservation  required  to 
remain  in  competitive  markets.  This  is  one  of  the  reasons  causing 
that  this  is  a  very  active  investigation  field.  Components  faults  and 
process  faults  can  cause  systems  damages  and  undesirable  halt  of 
the  system.  This  causes  the  increase  of  costs  and  decrease  of 
production.  Therefore  developing  mechanisms  to  detect  and  to 


1  Dpto  de  Ingenierfa  Electronica,  Sistemas  Informaticos  y  Automatica. 
Universidad  de  Huelva.  E-Mail:  {abadhe, asuarez@uhu.es) 

2  Dpto  de  Lenguaje  y  Sistemas  Informaticos.  Universidad  de  Sevilla. 
E-  Mail:  {gasca, ortega@lsi.us.es) 


diagnose  systems  faults  are  needed  to  maintain  the  systems  in 
levels  of  security,  production  and  reliability. 

Inside  the  Artificial  Intelligent  community  the  dynamic  systems 
diagnosis  task  has  been  approached,  in  most  of  the  cases,  adapting 
the  techniques  coming  from  the  static  systems  diagnosis  to  the 
dynamic  behaviour  of  the  systems.  This  way  [2]  or  [3]  try  to  add 
temporary  information  to  GDE  [4] 

On  the  other  hand,  qualitative  models  have  also  been  commonly 
used  for  this  purpose  [5]  [6], 

In  [7]  the  fundaments  of  the  based-models  diagnosis,  applied  to 
the  dynamic  systems,  are  presented,  and  more  recently  [8]  proposes 
a  consistency-based  approach  with  qualitative  models. 

Other  techniques,  coming  from  the  AI,  have  also  entered  in 
the  diagnosis  field.  Following  this  line,  learning  techniques  tries  to 
identify  the  system  behaviour  basing  on  a  previous  training. 

Lately,  some  works  using  learning-based  techniques  have  been 
presented,  like  stochastic  methods  [9],  neural  network  based 
learning  [10]  and  classification  systems  [11],  Neural  network 
techniques  have  recently  been  applied  in  diverse  fields,  as 
medicine  [12]  or  power  supply  [13]. 

Machine  Learning  techniques,  inside  the  supervised  learning 
field,  are  automated  procedures  based  on  logical  operations  that 
learn  a  task  starting  from  a  suite  of  examples.  In  the  classification 
field  the  attention  has  been  centred,  concretely,  in  approaches  with 
decision  trees  [14],  where  classification  is  the  result  of  a  series  of 
logical  steps.  These  approaches  are  able  to  represent  the  most 
complex  problems  if  they  have  enough  data.  Applied  to  the 
diagnosis,  we  can  find  these  methods  used  for  the  classification  of 
temporary  patterns  [15]  or  in  previous  works  to  the  current  one 
[16]  [17], 

The  present  work  is  centred  in  quantitative  models.  It  uses 
supervised  learning  techniques  to  obtain  a  rules-based  model  to 
diagnose  dynamic  systems  by  recognizing  the  correct  behaviour 
models  and  faulty  behaviour  models.  An  approach  to  offer  several 
fault  causes,  when  there  isn’t  an  only  clear  cause,  is  presented. 

Rest  of  the  document  has  been  organized  in  the  following  way: 
in  the  next  section  the  used  methodology  will  be  exposed  and  the 
form  to  carry  out  the  diagnosis.  Next  a  problem  application 
example  is  described  for  the  developed  approach.  To  illustrate  the 
operation  of  these  techniques  a  wide  set  of  tests  is  presented.  Lastly 
some  improvements  that  are  in  development  process  are  discussed. 

2  PROPOSED  METHODOLOGY 

To  carry  out  diagnosis  of  dynamic  systems  a  set  of  decision  rules 
should  be  generated.  It  can  be  done  starting  from  the  known 


trajectories  of  the  system  or  the  simulations  generated  from  a 
model. 

Before  starting  with  the  methodology  some  concepts  need  to  be 
defined. 

2.1  Definitions  and  notation. 

Definition  1:  Behaviours  Family.  It  is  a  finite  group  of 

trajectories  having  a  similar  behaviour  from  the  point  of  view  of 
the  diagnosis. 

Definition  2:  Correct  behaviour.  It  is  the  finite  group  of 

trajectories  belonging  to  evolutions  of  the  system  without  any  fault 
type. 

Definition  3:  Perfect  behaviour.  It  is  the  trajectory  describing  the 
system  when  all  parameters  take  the  central  values  of  the  ranges 
defined  as  correct. 

Definition  4:  Observation.  It  is  a  real  trajectory  of  the  dynamic 
system  containing  values  of  the  observational  variables  in  the 
system. 

Definition  5:  Diagnosis.  It  is  the  identification  of  the  observed 
behaviour  of  the  system  as  belonging  to  a  certain  behaviour  family 
(diagnosis  label)  and  according  to  decision  rules. 

Proposed  approach  can  be  generated  from  two  different  ways: 

•  Rules  are  generated  starting  from  a  group  of  different 
behaviour  models. 

v  Mode]  (behaviour)  =>  labelled  trajectories 

•  Rules  are  generated  starting  from  a  group  of  experimental 
trajectories  of  dynamic  system  for  the  correct  behaviour  and 
possible  fault  behaviour. 

v  Trajectories  ( behaviour )  =>  labelled  trajectories. 
Leaving  of  one  of  these  situations  the  process  can  continue  like 
that: 

1.  Similar  trajectories  belonging  to  different  behaviours  family  are 
identified.  These  trajectories  are  labelled  again  as  belonging  to 
both  behaviours  family. 

v  Similar  Trajectories  (different  behaviour  family)  => 
relabelled  trajectories. 


2.  Decision  rules  are  generated  using  a  supervised  learning  tool. 

v  Relabelled  trajectories  =>  Decision  rules 

3.  Diagnosis  consists  in  associating  an  observation  as 
corresponding  to  behaviours  family  by  using  decision  rules. 

Classification  ( observation ,  rules)  =>  Diagnostic  label 

2.2  Methodology 

Proposed  methodology  to  diagnose  is  an  amplification  of  other  one 
developed  in  [16].  This  basic  methodology  may  present  some 
problems  when  the  same  system  behaviours  can  be  associated  to 
different  fault  reasons.  In  order  to  don’t  diagnose  incorrectly  these 
cases,  in  this  new  approach,  those  behaviours  will  be  associated 
with  all  the  possible  behaviours  family  that  can  cause  this  concrete 
behaviour.  In  this  way  several  fault  causes  will  be  offered  for 
observations  that  can  correspond  to  different  behaviours  family. 

Basic  idea  consists  in  obtaining  a  set  of  classification  rules  from 
a  suite  of  system  data  in  different  behaviours  modes:  the  correct 
behaviour  and  the  faulty  behaviours.  After,  those  obtained 
classification  rules  can  be  used  to  associate  an  observation  with 
model  behaviour.  Thus  diagnosis  of  the  observation  is  obtained. 

Process  can  start  with  real  system  data  or  with  a  model  of  the 
dynamic  system.  In  the  second  case,  a  set  of  simulations  of  the 
system  is  required  to  obtain  the  necessary  data.  In  both  cases, 
obtained  data  will  be  labelled  according  to  the  running  conditions 
of  the  system  at  the  gathering  data  time.  Label  indicates  the 
running  system  state:  correct  working  or  abnormal  function  of  any 
system  component.  Final  result  consists  in  a  database  containing  all 
labelled  trajectories. 

Obtained  database  contains  very  similar  trajectories 
corresponding  to  different  behaviour  family  and  therefore  with 
different  labels.  To  solve  this  problem  the  set  of  all  similar 
trajectories  will  be  relabelled  with  new  labels.  This  new  labels  will 
be  composed  as  a  mix  of  the  older  labels.  Thus,  relabelled 
trajectories  will  be  associated  with  anyone  of  the  original 
behaviours  family.  The  problem  is  to  define  when  two  or  more 
trajectories  are  similar.  Decision  taken  is  that  several  trajectories 
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Figure  1.  Proposed  Methodology 


are  similar  when  distance  between  them  is  lower  than  a  magnitude. 
That  magnitude  should  be  specified  for  each  treated  system.  Used 
distance  is  Euclidean  distance. 

After  being  labelled  and  relabelled,  trajectories  data  will  be 
treated  to  add  additional  information  about  running  of  the  system. 
This  additional  information  will  be  very  useful  when  classification 
tool  tries  to  find  decision  rules,  because  available  information  will 
be  greater.  This  additional  information  should  characterize  the 
system  further  than  gathering  data  and  it  is  specified  for  each 
treated  systems. 

A  new  database,  which  contains  original  trajectories  plus  new 
attributes  and  the  corresponding  label,  is  obtained. 

Final  step,  to  obtain  decision  rules,  is  to  use  a  classification  tool 
with  the  labelled  and  treated  database. 

An  aspect  to  highlight  is  that  all  process,  until  this  moment, 
have  been  development  off-line,  and  time  needed  for  this  process  is 
not  important  for  the  diagnosis  process. 

Diagnosis  process  consists  on  evaluating  an  observation  with 
the  obtained  decision  rules.  Time  spending  to  diagnose  is  only  the 
time  of  evaluating  obtained  decision  rules.  Decision  rules  returns 
the  label  associated  to  the  behaviour  by  correspondence  between 
training  data  and  observed  data.  This  returned  label  is  offered  as 
diagnosis. 

Next  a  case  study  will  be  presented  to  develop  this 
methodology. 


A. 

d 


Figure  2.  The  example  system 

3  CASE  STUDY 

As  it  has  been  commented  previously,  methodology  can  be  used 
with  real  system  data  or  with  obtained  data  of  a  model  simulation. 
In  our  case,  the  methodology  will  be  applied  to  a  model,  which  is 
an  idealized  situation,  but  it  offers  us  a  clear  idea  of  the  way  to  act. 
In  case  of  application  on  a  real  system,  many  difficult  aspects,  not 
mentioned  here  (as  monitoring  or  small  phase  shift),  need  to  be 
taken  in  account,  but  with  the  model  we  are  only  trying  to  present 
the  approach. 

As  example  of  dynamic  system  to  diagnose  we  consider  the 
controller  electric  motor  in  [18]  and  [19].  Figure  2  represents 
treated  system.  The  motor  ‘M\  whose  rotational  speed  is  V’,  is 
driven  through  a  voltage  V  by  the  controller  ‘C’  which  acts  based 
on  the  desired  speed  ‘ d ’  and  the  speed  ‘wm’  measured  by  the 
revolution  counter  ‘S’.  Controller  ‘C’  is  considered  as  an  I- 
controller. 


System  can  be  modelled  by  the  following  equations,  which 
include  a  constant  for  each  component  that  is  used  to  model  also 
the  faulty  behaviour  of  the  component: 


dw 

Motor :  T  * =  Cm  *  v  -  w 

(1) 

dt 

I  -  Controller :  —  =  Cc  *{d  -  Wm) 
dt 

(2) 

Sensor :  w„,  =  c.<  *  w 

(3) 

Where  T  is  the  inertia  of  the  motor,  cm  is  the  constant  of  the 
motor;  cc  is  the  constant  of  the  controller  and  cs  is  the  constant  of 
the  revolution  counter. 

Component  anomalous  operation  is  caused,  mainly,  by  the 
deviation  of  the  component  constant  nominal  value.  These 
constants  stray  of  the  considered  correct  values  range 

Some  faults  represent  that  constants  take  values  above  the 
correct  ones  and  others  faults  represent  that  constants  take  values 
below  the  correct  ones.  Diagnosis  result  should  indicate,  in 
addition  to  the  faulty  component,  if  taken  values  for  the  component 
constant  are  below  correct  values  or  above  them. 

Possible  fault  reasons  that  we  want  to  identify  are  therefore: 
‘CmHigh’  when  values  of  Cm  are  above  the  correct  ones; 
‘CmLow’  when  values  of  Cm  are  below  the  correct  ones;  ‘CsHigh’ 
when  values  of  Cs  are  above  the  correct  ones;  ‘CsLow’  when 
values  of  Cs  are  below  the  correct  ones;  ‘CcHigh’  when  values  of 
Cc  are  above  the  correct  ones  and  ‘CcLow’  when  values  of  Cc  are 
below  the  correct  ones. 

To  describe  the  system  correct  behaviour,  it  is  considered  that 
values  of  all  constants  don'  t  have  only  one  correct  value,  but  rather 
they  can  take  values  inside  an  interval  that  will  be  considered  as 
correct. 

This  way,  operation  flexibility  is  allowed  and  system  real 
behaviour  is  better  simulated,  where  there  is  not  a  correct  value  but 
rather  correction  margins  are  flexible.  This  produces  that  system 
doesn'  t  have  an  only  correct  behaviour,  but  rather  a  correct 
behaviours  family.  It  represents  all  possible  combinations  of  the 
constants  values  that  are  inside  of  the  defined  tolerance  limit. 

A  correct  behaviours  family  does  the  diagnosis  more  difficult, 
because  it  is  necessary  to  recognize  different  behaviours  as  correct, 
but  on  the  contrary  it  provides  a  more  realistic  vision  of  the  system. 

In  our  model  the  constant  values  considered  as  correct  are: 


Table  1.  Values  for  OK  behaviours 


Cm 

[0.98-1.02] 

Cc 

[0.98-1.02] 

Cs 

[0.98-1.02] 

Other  considered  characteristics  in  our  system  are: 

1.  Fault  is  present  from  the  beginning  and  it  doesn'  t  evolve  in  the 
time. 

2.  Behaviour  change  occurs  instantly  and  starting  from  here  it 
doesn'  t  change  again. 

3.  Once  the  wanted  angular  speed  has  been  indicated,  it  doesn'  t 
change  until  this  angular  speed  is  reached. 


This  way,  diagnosis  will  be  carried  out  when  the  desired  angular 
speed  (d)  is  changed.  The  way  to  diagnose  is  by  checking  the 
evolution  to  reach  the  final  speed.  It  is  necessary  to  keep  in  mind 
that  in  spite  of  existence  of  a  failure  in  some  component,  I- 
controller  is  able  to  act  on  the  motor  to  reach  the  required  final 
speed.  Of  course  evolution  of  the  system  to  reach  the  desired  final 
speed  will  be  different.  This  difference  in  the  behaviour  will  allow 
the  diagnosis. 


«ta=J8ffis 

V=  INTEG(F2) 
W=  INTEG(f) 

F2  =  Cc*(d-5M 
F  =  (MY-W)/T 


First  step,  therefore,  is  performing  system  simulations  in 
different  behaviours  modes.  In  our  case,  system  has  been  modelled 
as  a  Forrester  diagram  [20],  to  be  able  to  simulate  using  the 
simulation  tool  VEMSIM®.  Forrester  diagram  generated  for  the 
system  is  presented  in  figure  3. 

Simulated  behaviours  will  be  those  that  we  want  to  diagnose. 
They  will  be:  OK  for  correct  behaviour  and  CmHigh,  CmLow, 
CsHigh,  CsLow,  CcHigh,  CcLow  for  each  component  fault  above 
mentioned. 

A  behaviour  family  will  represent  each  one  of  these  behaviours. 

Simulations  values  are  shown  in  table  2. 


Table  2.  System  values  for  simulation 


T 

3 

D 

10 

W 

5 

Time  Step 

0.1 

For  the  correct  behaviour  the  constant  values  are  into  [0.98- 
1.02],  Values  to  simulate  behaviours  above  the  correct  one  are  into 
[1.02-5].  Values  to  simulate  behaviours  bellow  the  correct  one  are 
into  [0-0.98], 

Constants  values  for  simulated  behaviours  have  been  elected  by 
random  with  the  Monte  Carlo  method  following  a  uniform 
distribution.  Number  of  simulations  per  behaviour  will  be  100. 

Label  corresponding  to  behaviour  is  placed  to  each  one  of  the 
trajectories.  This  way,  a  database  containing  700  labelled 
trajectories  is  obtained. 

Trajectories  are  composed  with  values  of  the  variable  ‘ wm ‘  in 
each  time  step.  Reason  to  select  variable  ‘wm’  and  not  V  is  that 
‘wm’  is  the  only  observable  variable  in  the  real  system. 

In  figures  4,  5  and  6  different  system  behaviours  are  shown. 

Obtained  database  has  similar  trajectories  belong  to  different 
behaviours.  This  way  several  very  similar  trajectories  have 
different  labels.  This  is  a  problem,  because  our  final  goal  is  to  use  a 


classification  tool  to  obtain  a  set  of  decision  rules,  and  if  we  have 
similar  trajectories  with  different  labels  then  classifier  can’t 
correctly  work;  that  is  to  say,  those  similar  trajectories  will  be 
incorrectly  classified.  Figure  7  shows  an  example  of  this. 
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Figure  4.  OK  Behaviour 
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Figure  5.  CmHigh  Behaviour 
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Figure  6.  CcLow  Behaviour 


To  solve  this  problem  a  new  label  will  be  assigned  to  very 
similar  trajectories.  A  mixture  of  labels  of  all  similar  trajectories 
will  compose  the  new  label.  This  way,  next  step  is  to  find  all 
similar  trajectories  into  the  database  and  assigning  a  new  label. 


It  is  necessary  to  define  when  two  or  more  trajectories  are 
similar.  Two  trajectories  are  considered  similar  when  distance 
between  them  is  smaller  than  a  magnitude.  Distance  between 
trajectories  is  measured  as  Euclidean  Distance  and  magnitude 
chosen  is  10%  of  the  Euclidean  distance  between  the  two  further 
away  trajectories  for  the  correct  behaviour.  This  magnitude  in  our 
example  is  0.45. 


Time  (Second) 

Figure  7.  Behaviour  CcHigh  vs  CmHigh 

After  this  process  we  obtain  a  new  database  with  all  similar 
trajectories  re-labelled  as  corresponding  with  all  behaviours  of  the 
similar  trajectories. 

Next  step  is  to  calculate  new  attributes  of  each  trajectory  with 
the  goal  that  classifier  has  more  information  to  generate  decision 
rules.  These  new  attributes  must  be  representative  for  each 
trajectory. 

For  each  trajectory  point  next  attributes  have  been  calculated: 

•  Distance  to  perfect  behaviour.  It  indicates  how  far  away  is 
current  trajectory  from  perfect  behaviour  (above  defined).  It 
is  calculated  as: 

DP(i)  =  W>«[i]  -  Wmpf[i\  (4) 

Where  Wm[i]  is  the  treated  point  in  the  current  trajectory  and 
Wmpf[i]  is  the  correspondent  point  in  the  perfect  behaviour. 

•  Integral.  It  is  the  magnitude  returned  by  numerical  integration 
between  current  point  and  the  precedent  one.  It  represents  the 
closed  area  between  them.  It  is  calculated  by  approximating 
as  follow: 

/(i)=rsxibN  (5) 

2 

Where  Ts  is  the  time  step  in  the  simulation,  p[i]  is  the  current 
treated  point  and  p[i-l ]  the  precedent  one. 

In  addition  next  attributes  will  be  calculated  for  each  trajectory: 

•  Rise  Time  (RT).  It  is  the  moment  in  which  desired  revolution 
speed  is  reached  for  first  time. 

•  Steady  state  (SS).  It  is  the  moment  in  which  desired 
revolution  speed  is  reached  definitively. 

•  Max  speed  (MS).  It  is  the  value  of  the  highest  revolution 
reached  speed. 


•  Max  speed  time  (MST).  It  is  the  moment  in  which  the  highest 
revolution  speed  is  reached. 

This  way  a  new  database  containing  trajectories  plus  new 
attributes  is  generated. 

Data  in  new  database  have  the  following  form: 

RT.  SS.  MS.  MST,  Wm[l],  DP[1],  III],  .  Wm[n],  DP[n],  I[n], 

LABEL 

Final  step  is  performing  supervised  learning  with  the  obtained 
database.  Classification  tool  selected  to  perform  the  supervised 
learning  is  C4.5  [21].  What  is  gotten  with  this  tool  is  to 
characterize  each  one  of  the  behaviour  families  according  to  the 
values  of  the  attributes  that  have  been  provided.  Result  is  a 
decision  tree  and  an  equivalent  set  of  decision  rules.  These  rules 
will  be  the  way  to  do  the  diagnosis.  In  our  example  classifier 
obtains  27  rules  with  an  error  rate  of  1.2%.  This  mean  that  1.2%  of 
trajectories  are  not  correctly  classified  with  those  rules. 

3.1  Diagnosis 

The  way  to  do  the  diagnosis  is  evaluate  the  observed  data  with  the 
obtained  rules. 

Because  in  rules  appear  attributes  that  have  been  calculated  and 
not  appear  in  observed  data,  same  attributes  should  be  calculated 
for  observed  data  in  order  to  be  able  to  classify  with  those  rules. 

This  way  in  the  moment  that  one  observed  data  is  gathered  all 
possible  attributes  should  be  calculated.  After  that,  decision  rules 
are  evaluated  with  two  possible  results:  a  label  is  returned  or 
information  is  insufficient  to  evaluate  all  rules.  In  the  first  case  the 
returned  label  is  the  result  of  the  diagnosis.  In  the  second  one  we 
need  to  wait  more  information  in  further  moments. 

If  we  want  to  diagnose  the  system  with  another  running 
conditions,  we  should  have  prepared  the  decision  rules  set  for  those 
specific  conditions.  I.  e.  if  we  want  to  diagnose  this  system  when 
current  rotational  speed  is  12  rad/sec  and  desired  rotational  speed 
is  7  rad/sec,  we  should  have  generated  a  set  of  decision  rules  for 
those  conditions  and  we  will  use  them  in  the  diagnosis  moment. 

4  RESULTS  ON  THE  EXAMPLE  SYSTEM 

To  evaluate  the  proposed  methodology  a  set  of  tests  have  been 
done. 

Observational  data  have  been  obtained  by  simulating  the  system 
with  specific  conditions  for  the  test.  This  way  a  test  trajectory  is 
obtained  and  the  diagnosis  correct  result  is  known,  because  it  must 
be  the  corresponding  to  the  simulated  conditions. 

Conditions  of  the  test  are  the  same  above  mentioned.  We 
remember  them  in  table  3: 


Table  3.  Tests  conditions 


T 

3 

D 

10 

W  initial 

5 

Time  Step 

0.1 

Values  for  OK 

[0.98-  1.02] 

Values  for  HIGH 

[1.02-5] 

Values  for  LOW 

[0  -  0.98] 

In  table  4  we  can  see  results  for  the  tests: 


Table  4.  Tests  results 


VALUE  OF  THE 

CONSTANT 

CORRECT 

DIAGNOSIS 

DIAGNOSIS 

WITH  SIMPLE 

LABELLED 

DIAGNOSIS 

WITH 

RE¬ 

LABELLED 

Cm 

Cc 

Cs 

i 

1 

1.03 

CS  HIGH 

CS  HIGH 

CS  HIGH 

i 

1 

1.07 

CS  HIGH 

CS  HIGH 

CS  HIGH 

i 

1 

1.1 

CS  HIGH 

CS  HIGH 

CS  HIGH 

i 

1 

1.5 

CS  HIGH 

CS  HIGH 

CS  HIGH 

i 

1 

2 

CS  HIGH 

CS  HIGH 

CS  HIGH 

i 

1 

3 

CS  HIGH 

CS  HIGH 

CS  HIGH 

i 

1.03 

1 

CC  HIGH 

OK 

OK 

i 

1.07 

1 

CC  HIGH 

CM  HIGH 

CC  HIGH  1 

CM  HIGH 

i 

1.1 

1 

CC  HIGH 

CM  HIGH 

CC  HIGH  1 

CM  HIGH 

i 

1.5 

1 

CC  HIGH 

CC  HIGH 

CC  HIGH 

i 

2 

1 

CC  HIGH 

CC  HIGH 

CC  HIGH 

i 

3 

1 

CC  HIGH 

CC  HIGH 

CC  HIGH 

1.03 

1 

1 

CM  HIGH 

OK 

OKI 

CS  LOW 

1.07 

1 

1 

CM  HIGH 

CM  HIGH 

CC  HIGH  1 

CM  HIGH 

1.1 

1 

1 

CM  HIGH 

CM  HIGH 

CC  HIGH  1 

CM  HIGH 

1.5 

1 

1 

CM  HIGH 

CM  HIGH 

CM  HIGH 

2 

1 

1 

CM  HIGH 

CM  HIGH 

CM  HIGH 

3 

1 

1 

CM  HIGH 

CM  HIGH 

CM  HIGH 

1 

1 

0.97 

CS  LOW 

OK 

CS  LOW  1 

OK 

1 

1 

0.93 

CS  LOW 

CS  LOW 

CS  LOW 

1 

1 

0.89 

CS  LOW 

CS  LOW 

CS  LOW 

1 

1 

0.85 

CS  LOW 

CS  LOW 

CS  LOW 

1 

1 

0.5 

CS  LOW 

CS  LOW 

CS  LOW 

1 

1 

0.1 

CS  LOW 

CS  LOW 

CS  LOW 

1 

0.97 

1 

CC  LOW 

OK 

OK 

1 

0.93 

1 

CC  LOW 

CC  LOW 

CC  LOW  1 

CM  LOW 

1 

0.89 

1 

CC  LOW 

CC  LOW 

CC  LOW  1 

CM  LOW 

1 

0.85 

1 

CC  LOW 

CC  LOW 

CC  LOW  1 

CM  LOW 

1 

0.5 

1 

CC  LOW 

CC  LOW 

CC  LOW 

1 

0.1 

1 

CC  LOW 

CC  LOW 

CC  LOW 

0.97 

1 

1 

CM  LOW 

OK 

OK 

0.93 

1 

1 

CM  LOW 

CC  LOW 

CC  LOW  1 

CM  LOW 

0.89 

1 

1 

CM  LOW 

CM  LOW 

CC  LOW  1 

CM  LOW 

0.85 

1 

1 

CM  LOW 

CM  LOW 

CC  LOW  1 

CM  LOW 

0.5 

1 

1 

CM  LOW 

CM  LOW 

CM  LOW 

0.1 

1 

1 

CM  LOW 

CM  LOW 

CM  LOW 

0.99 

0.98 

1.02 

OK 

OK 

OK 

1 

1.02 

1.02 

OK 

OK 

OK 

0.98 

1 

0.98 

OK 

OK 

OK 

0.98 

1.02 

1.02 

OK 

OK 

OK 

0.99 

1.01 

1.01 

OK 

OK 

OK 

1.01 

1 

0.99 

OK 

OK 

OK 

We  can  see  that  diagnosis  methodology  with  simple  labelled 
doesn’t  offer  a  correct  diagnostic  in  tests  that  are  very  near  of  the 
correct  behaviour.  In  those  cases  the  fault  is  not  detected.  Other 


times,  methodology  returns  an  incorrect  diagnosis,  but  in  general 
offered  results  are  acceptable. 

This  occurs  because  there  are  very  similar  trajectories  belonging 
to  different  behaviours,  and  classifier  cannot  correctly  select  the 
rules  to  difference  them. 

To  solve  this  problem  the  new  methodology  proposes  the  re¬ 
labelled  of  all  similar  trajectories  as  have  been  above  mentioned. 
Obtained  results  show  that  the  new  methodology  offers  a  multiple 
diagnosis  when  the  previous  one  can't  find  the  correct  fault. 
Among  the  multiple  offered  diagnoses,  near  to  all  tests  return  the 
correct  one. 

It  is  important  to  highlight  that,  in  tests  where  behaviour  is  far 
of  the  correct  one,  offered  diagnosis  is  the  correct  one. 

In  the  set  of  presented  tests  the  diagnosis  is  correct  in  58.33  % 
of  the  cases.  Correct  diagnosis  is  offered,  among  others,  in  30.55  % 
of  the  cases.  An  incorrect  diagnosis  is  offered  in  2.7  %  of  the  cases. 
The  fault  is  not  detected  in  8.33  %  of  the  cases.  Otherwise,  never 
detect  failure  when  failure  doesn't  exist. 

5  CONCLUSIONS  AND  FURTHER 
WORKS 

Presented  methodology  is  able  to  perform  diagnosis  of  dynamic 
systems  and  it  is  independent  of  the  system  type.  In  fact,  one  of 
further  works  is  to  apply  this  methodology  to  a  non-linear  dynamic 
system. 

This  capacity  is  due  to  the  fact  that  the  methodology  is  only 
centred  in  the  evolution  characteristics  of  the  system  for  the  correct 
behaviour  or  faulty  behaviours. 

Another  characteristic  of  the  methodology  is  that  the  diagnosis 
can  be  performed  in  a  very  simple  way,  and  a  very  little 
computational  time  is  required. 

Certain  systems,  as  the  presented  in  the  example,  can  produce 
similar  behaviours  for  different  fault  reasons.  This  is  due  to 
relationship  among  variables  that  govern  the  system  behaviour. 
This  relationship,  among  system  variables,  can  produce  that  an 
alteration  of  a  variable  would  be  compensated  by  the  alteration  of 
another  variable  in  contrary  sense.  To  solve  this  problem, 
methodology  assigns  multiple  fault  reasons  to  system  behaviours 
that  could  be  produced  by  different  fault  reasons.  This  way  a 
multiple  diagnosis  is  offered  in  those  situations. 

Another  further  work  is  to  be  able  to  diagnose  dynamic  system 
when  multiple  fault  occurs  at  the  same  time,  is  to  say,  identifying 
system  behaviours  when  more  than  one  component  is  faulty. 
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Abstract 

We  investigate  the  problem  of  doing  post  mortem 
fault  isolation  for  concurrent  systems  using  a  be¬ 
havioral  model.  The  aim  is  to  isolate  the  action 
that  has  caused  the  failure  of  the  system,  the  root 
action.  The  naive  approach  would  be  to  say  that 
a  certain  action  is  the  root  action  iff  it  is  a  logical 
consequence  of  the  model  and  observations  that  the 
action  is  the  first  “bad  thing  to  happen”.  This,  how¬ 
ever,  is  a  strong  requirement  and  puts  high  demand 
on  the  model.  In  this  paper  we  describe  the  con¬ 
cept  of  strong  root  candidate,  a  relaxation  of  the 
naive  approach.  The  advantage  of  determining  the 
strong  root  candidate  directly  from  model  and  ob¬ 
servations  is  that  the  set  of  traces  consistent  with 
model  and  observations  need  not  be  explicitly  com¬ 
puted.  The  property  of  strong  root  candidate  can 
instead  be  determined  on-the-fly,  thus  only  comput¬ 
ing  relevant  parts  of  the  reachable  state  space. 

1  Introduction 

In  this  paper  we  describe  a  model-based  [Hamscher  el  al., 
1992]  approach  to  fault  isolation  in  object  oriented  control 
software.  The  work  is  motivated  by  a  real  industrial  robot 
control  system  developed  by  ABB  Robotics.  The  system  is 
large  (the  order  of  106  lines  of  code),  concurrent,  has  an  ob¬ 
ject  oriented  architecture  and  is  highly  configurable,  support¬ 
ing  different  types  of  robots  and  cell  configurations.  Since  the 
system  is  time-  and  safety-critical  the  first  priority,  in  case  of 
a  failure,  is  to  bring  the  system  to  a  safe  state;  alarms  that  go 
off  are  logged  and  can  be  analyzed  when  the  system  comes 
to  a  stand-still.  The  faults  considered  are  primarily  hardware 
faults,  and  therefore  we  rely  on  the  assumption  that  the  failing 
hardware  has  some  software  counterpart  that  is  affected  by 
the  failure  of  the  hardware.  In  addition  we  make  the  common 
single  fault  assumption,  i.e.  that  a  system  failure  is  caused  by 
only  one  fault  (but  resulting  in  cascading  alarms). 

The  log  thus  contains  partial  information  about  the  events 
that  took  place  at  the  approximate  time  of  the  system  failure. 
However,  the  order  in  which  messages  are  logged  does  not 
necessarily  reflect  the  way  error  messages  propagate  -  the 
system  is  concurrent  and  safety  critical  actions  may  have  to 
be  taken  before  error  reporting  takes  place.  Hence,  in  what 


follows  we  (somewhat  conservatively)  view  the  log  as  a  set  of 
error  messages.  In  addition  a  system  may  contain  a  number  of 
critical  events  that  are  unobservable,  but  which  may  explain 
all  observable  alarms. 

The  ultimate  aim  of  our  fault  isolation  method  is  to  single 
out  the  error  message  that  explains  the  actual  cause  of  the  fail¬ 
ure,  or  possibly  an  unobservable  critical  event  explaining  the 
observations.  That  is,  we  aim  to  discard  error  messages  which 
are  definitely  effects  of  other  error  messages,  while  trying  to 
isolate  error  messages  (or  critical  events)  which  explain  all 
other  messages.  In  contrast  to  message  filtering,  we  can  thus 
find  failing  components  that  have  not  manifested  themselves 
in  the  error  log,  if  the  failing  of  the  component  is  a  logical 
consequence  of  the  model  and  the  observations.  Given  the 
size  of  the  software  it  is  not  possible  to  use  the  code  directly 
-  we  have  to  rely  on  a  model  of  the  software.  In  this  paper  we 
consider  finite  state  machine  models  expressed  in  a  process 
algebra.  The  process  algebra  is  chosen  here  because  it  allows 
for  more  straightforward  formal  reasoning  than  for  example 
state  charts,  but  the  contribution  of  this  work  -  the  fault  iso¬ 
lation  -  relies  only  on  the  labeled  transition  system  semantics 
of  the  model.  In  practice,  the  aim  is  to  use  a  behavioral  model 
that  is  an  artifact  of  the  software  development  process,  such 
as  state  charts.  Then  there  is  no  extra  cost  associated  with 
maintaining  a  correct  model  when  the  software  evolves,  since 
then  so  does  the  model. 

In  standard  AI  diagnosis  literature,  see  e.g.  [Reiter,  1987], 
a  diagnosis  is  a  (minimal)  set  of  failed  components  explain¬ 
ing  the  observations.  But  for  dynamic  systems  (systems  with 
state)  a  diagnosis  is  often  defined  as  the  set  of  all  traces,  or  tra¬ 
jectories,  consistent  with  the  observations  (see  e.g.  [Cordier 
el  al.,  2001;  Console  el  al.,  2000]).  However,  this  definition  is 
generally  insufficient  to  isolate  the  origin  of  the  fault(s),  and 
requires  post-processing  to  pin-point  e.g.  the  faulty  compo- 
nent(s).  Our  approach  is  more  direct  and  focuses  on  finding 
the  alarm  that  explains  (is  consistent  with)  all  observables: 
given  the  system  description,  expressed  in  a  simple  process 
algebra,  and  the  observations,  we  try  to  infer  the  origin  of  the 
fault  using  properties  of  actions  involving  the  temporal  or¬ 
der,  expressed  in  a  specification  language  based  on  a  subset 
of  the  temporal  logic  CTL,  originally  developed  for  verifi¬ 
cation  [Clarke  el  al.,  1999].  This  resembles  the  process  of 
model  checking  and  as  in  the  case  of  model-checking  there 
is  no  need  for  calculation  of  the  entire  state  space  (obviously 


equivalent  to  the  set  of  traces  consistent  with  model  and  ob¬ 
servations)  if  the  temporal  logic  formulae  are  evaluated  by 
constructing  the  state  space  on-the-fly. 

Our  approach  also  bears  some  resemblance  to  that  of  Sam- 
path  et  al.  [Sampath  et  al.,  1995].  However  their  work  is 
mainly  concerned  with  diagnosability  in  discrete  event  sys¬ 
tems;  i.e.  to  detect,  within  finite  delay,  whether  a  certain  type 
of  fault  has  occurred.  While  our  approach  is  amenable  only  to 
post-mortem  analysis,  the  work  reported  in  [Sampath  et  al, 
1995]  is  mainly  intended  for  monitoring  and  on-line  detection 
and  diagnosis. 

The  rest  of  the  paper  is  organized  as  follows:  In  Section 
2  we  describe  the  behavior  language  that  will  be  used  to  de¬ 
fine  a  transition  relation,  that  defines  the  set  of  all  possible 
behaviors  (i.e.  traces).  In  Section  3  we  provide  rules  for  en- 
tailment  of  some  predicates  of  interest  from  configurations 
and  the  traces  that  can  follow  from  them.  Finally,  we  outline 
ongoing  and  future  work  in  Section  4. 

2  A  behavior  language 

A  behavior  model  can  be  expressed  in  different  ways,  and  we 
have  chosen  to  use  a  process  algebra.  No  matter  which  for¬ 
malism  and  notation  that  is  used,  the  semantics  should  pro¬ 
vide  a  labeled  transition  relation  that  describes  the  state  tran¬ 
sitions  of  the  modeled  system.  In  this  section  we  describe  a 
process  algebra  influenced  by  CCS  [Milner,  1989]  and  give 
the  necessary  semantics. 

2.1  Processes 

Our  process  language  is  constructed  from  the  following  syn¬ 
tactic  categories 

•  a  finite  set  £  of  action  labels  denoted  by  a  in  our  meta 
language.  Every  action  label  is  equipped  with  an  associ¬ 
ated  arity  n  >  0. 

•  a  set  O  of  object  id’s  denoted  by  o. 

•  a  finite  set  S  of  states  A  with  associated  arity  n  >  0. 

We  consider  four  types  of  actions  ( denoted  by  a  in  our  meta 
language). 

•  Send  actions  of  the  form  o:a(t),  where  o  is  the  recipient 
object,  a  an  n-ary  action  label  and  t  is  an  n-tuple  of 
object  id's  or  variables. 

•  Receive  actions  of  the  form  a(x)  where  a  is  an  n-ary 
action  label  and  x  is  an  n-tuple  of  variables. 

•  Internal  actions  of  the  form  a,  where  a  is  a  nullary  action 
label. 

•  New-actions  of  the  form  new(o,  P)  where  o  €  O  and  P 
is  a  process  expression,  defined  below. 

A  process  is  described  by  a  process  expression ,  denoted  by 
P  (and  occasionally  Q ).  and  given  by  the  following  abstract 
syntax 

V::=A( t)  | 

iei 

where  7  is  a  finite  index  set.  Sums  are  usually  written  sim¬ 
ply  a±.Pi  +  a-z-P-i-  We  reserve  the  nullary  state  Stop  for  a 


completed  process.  We  assume  that  every  A/n  €  S  ( Stop 
excepted)  has  a  defining  equation  of  the  form 


A  process  state  a  is  a  partial  map  from  O  to  V .  The  object 
init  €  O  is  called  the  initializing  object,  the  state  Main  € 
S  is  called  the  main  process  and  the  state  <r0  :=  {init 
Main}  is  called  the  initial  process  state. 

Let  a:  O  — >  V  be  a  process  state,  o  €  O  and  P  £  V. 
By  a[o  i->  P]  we  denote  the  process  state  which  is  almost 
identical  to  a  except  possibly  at  o.  That  is 

r  m  t  \  f  P  if  x  =  o 
ai°  ^  ](a:)  •“  |  a( x)  otherwise 

The  behaviors  of  our  system  are  described  by  the  labeled  tran¬ 
sition  rules  in  Figure  1 .  Our  transitions  are  of  the  form 


where  a  (the  observation)  is  a  set  of  pairs  of  the  form  (o,  a) 
representing  action  a  occurring  in  object  o. 

There  are  four  transition  rules,  sync,  internal,  new  and  def. 
The  rule  sync  allows  two  objects  to  synchronize  their  state 
transitions  and  optionally  exchange  values.  In  our  limited  al¬ 
gebra,  the  only  values  that  can  be  transmitted  are  object  iden¬ 
tifiers.  However,  the  idea  is  not  to  model  all  system  behavior, 
but  to  have  a  system  model  that  reveals  synchronization  and 
system  structure.  The  rule  internal  allows  a  single  object  to 
perform  a  transition  by  itself.  Creation  of  new  objects  is  han¬ 
dled  by  the  rule  new,  and  def  allows  for  exchanging  a  state 
with  its  definition. 1 

Example 

Typically,  a  system  is  described  by  creating  a  main  process 
that  sets  up  the  system  structure.  Figure  2  shows  an  example 
of  such  a  system.  Process  Main  creates  three  objects  and 
runs  Setup  which  tells  the  objects  about  each  other  via  the 
init  call.  This  is  needed  since  when  started,  a  process  does 
not  know  anything  about  its  environment.  After  init,  each 
object  will  act  as  a  peer-to-peer  node,  as  showed  in  Figures  3 
(the  system)  and  4  (object  details).  Objects  can  send  requests 
to  each  other,  and  sometimes  the  answer  to  a  request  is  a  fail¬ 
ure,  and  then  the  system  is  brought  to  a  halt  by  transmission 
of  down  messages. 

3  Fault  Isolation 

The  available  information  when  doing  fault  isolation  is  a  sys¬ 
tem  model  and  an  observation  (in  our  case  a  message  log). 
We  use  the  term  scenario  to  refer  to  that  information.  In  the 
following  we  overload  the  term  action  in  the  context  of  sce¬ 
narios  to  mean  pairs  ( o,a )  £  Ox  £  where  o  is  an  object 
identifier  and  a  is  an  action  label.  Some  of  the  actions  in  a 
system  are  critical  actions,  actions  that  are  associated  with 
system  failures. 

Thus  a  scenario  is  a  quadruple  ( — >,  Crit,  Logp,  Logn), 
where  — >  is  a  process  state  transition  relation,  Crit  C  Ox  £ 

'Since  we  rely  on  a  finite  state  space  model,  we  do  not  allow 
unbounded  creation  of  objects  via  the  new  rule. 


sync 


a(oi)  =  Pi  +  Oj  :  a(t).P  +  P2  a(oj )  =  P3  +  a(x).Q  +  P4 

{ (oi  ,a) ,  (oj  ,&) }  p  Dir  a  r  /All 

a  — a[Oi  i->  P\[Oj  g{x/t}J 


internal  : 


a{pi)  =  Pi  +  a.P  +  P2 

{(°i>“)}  r  ,  D1 

a  — >■  a[Oi  P\ 


<j(oi)  =  Pi  +  new(o,  Q).P  +  P2  cr[oi  P][o  Q\  A  a' 


def : 


u(oj)  =  A(t)  A(x)  d—  P  a[oi  P{x/t}]  a' 
a  — a’ 

Figure  1:  Process  transition  rules  (t  is  a  vector  of  object  id’s) 


Servent(this,  x,  y ) 

def 

x:req(this).Wait(this,  x,  y)  +  y:req(this).  Wait  (this,  x,  y)  + 
req(o).Compute(this,  x,  y ,  o)  +  down().Down 

Wait(this,x,y ) 

def 

ok().Servent(this,x,y )  +  fail().Fail(x,y ) 

Compute(this,  x,  y,  o ) 

def 

o:ok().Servent(this,  x,  y)  +  o:fail().Servent(this,  x,  y) 

Fail(x,  y) 

def 

x:down().Fail(x,y)  +  y:down().Fail(x,y ) 

Down 

def 

Stop 

S 

def 

init(this,  x,  y).Servent(this,  x,  y) 

Main 

def 

new(s i,  S).new(s2,  S).new(sz,  S).Setup(s i,  s2,  S3) 

Setup(x,  y ,  z ) 

def 

x:init(x,  y ,  z).y:init(y,  z,  x).z:init(z,  x,  y).Stop 

Figure  2:  A  process  algebra  example 


is  the  set  of  critical  actions,  Logp  C  O  x  £  is  the  set  of 
actions  that  have  been  observed  (i.e.  the  message  log),  and 
Logn  C  O  x  C  is  the  set  of  actions  known  not  to  have  oc¬ 
curred  (i.e.  the  observable  actions  not  contained  in  the  mes¬ 
sage  log).  Thus,  we  assume  that  a  synchronized  action  is 
logged  as  two  separate  actions  -  one  from  the  sending  object 
and  one  from  the  receiving.  This  allows  modeling  of  mes¬ 
sage  sending  with  unknown  receiver  and  is  no  severe  limi¬ 
tation  since  it  is  possible  to  express  receiver  information  by 
having  a  model  where  the  desired  action  labels  are  unique  and 
receiver  object  id  thus  becomes  unambiguous. 

A  configuration ,  denoted  C,  is  the  symbol  1  or  a  pair  (o,  l) 
where  a  is  a  process  state  and  l  C  O  x  C  is  a  set  of  actions. 
The  following  rules  defines  the  configuration  transition  rela¬ 
tion  =>•  for  a  given  — >  and  Logn. 

a  -^4  o'  afl  Logn  =  0 
(o,  l)  =r  (o',  l  U  a) 

o  -^4  o'  afl  Logn  f  0 
0 o,l )  =>T 

The  configuration  ({ init  t-4  Main},  0)  is  called  the  initial 
configuration.  The  configuration  _L  is  called  a  forbidden  con¬ 
figuration  and  represent  configurations  that  are  allowed  by 
the  behavioral  model,  but  inconsistent  with  the  observations 
at  hand.  We  see  configurations  as  snapshots  of  the  system 
state  of  a  given  scenario,  and  the  configuration  transition  re¬ 
lation  describes  the  behavior  of  the  system.  Fault  isolation  is 
the  process  of  finding  the  first  critical  action  that  has  occurred 
in  a  given  scenario,  the  root  action .  Given  the  single  fault  as¬ 
sumption  and  a  system  model  that  is  properly  designed,  the 
first  critical  action  to  occur  in  the  system  is  the  cause  of  the 
failure. 

An  action  a  is  present  in  a  scenario  if  the  system  model 
and  the  observation  entails  the  occurrence  of  a.  An  action  a 
is  an  enabled  root  if  the  assumption  that  a  is  root  action  is 
consistent  with  the  observations  and  the  system  model.  We 
introduce  the  concept  of  strong  root  candidate,  and  say  that  a 
strong  root  candidate  is  an  action  that  is  both  present  and  an 
enabled  root. 

3.1  Predicate  rules 

Given  a  certain  scenario  (-»,  Crit,  Logp,  Logn),  we  wish  to 
reason  about  properties  of  reachable  configurations.  There¬ 
fore  we  define  predicates,  that  correspond  to  the  interesting 
properties,  by  determining  for  which  configurations  they  hold 
true.  Since  we  are  interested  in  strong  root  candidates,  we 
need  to  formally  define  present  actions  and  enabled  root  ac¬ 
tions.  Thus  we  define  the  predicate  present(a)  that  holds 
in  configurations  where  action  a  must  occur  sometime  in  the 
future  and  the  predicate  enabledroot(a)  that  holds  for  con¬ 
figurations  where  it  is  consistent  to  assume  that  a  may  be  the 
first  critical  action  to  occur.  In  defining  these  two  predicates, 
we  will  need  some  helper  predicates.  We  will  use  okend  that 
holds  in  configurations  that  correspond  to  consistent  ending 
states  of  the  system.  An  ending  state  is  a  state  where  no  more 
observable  actions  occur,  i.e.  when  the  system  has  reached  a 
final  state.  In  a  configuration  where  a  has  occurred,  seen(a) 


holds,  while  nocrit  holds  in  configurations  where  no  critical 
action  has  occurred.  The  predicate  end  holds  in  configura¬ 
tions  where  there  is  no  next  configuration. 

We  define  entailment  of  logical  formulae  from  the  follow¬ 
ing  syntax: 

T  ::=  T  V  T  \  T  A  T  |  T  \  EF(T )  \  EX(T )  \  AG(T )  \ 

end  |  okend  \  nocrit  \ 
seen(a)  \  present(a )  |  enabledroot(a ) 

In  order  to  be  able  to  define  entailment  for  the  desired  predi¬ 
cates,  we  will  need  the  following.  We  use  =G  for  the  reflexive 
transitive  closure  of  =>.  First  we  define  entailment  for  basic 
connectives. 

C\=F1  C\=F2  C\=F2 

C  |=  F±  A  F2  C\=F1VF2 

C  \=  F±  C\£F 

C\=FXW  F2  C\=^F 

We  will  be  reasoning  about  temporal  order,  so  we  need  to 
define  temporal  logic  operators. 

C^C'  C'  \=F 
C  |=  EF(F) 

C^C'  C'  |=  F 
C  |=  EX(F) 

C'  \=  F  whenever  C  4>  C" 

C  |=  AG(F) 

We  also  need  entailment  for  a  few  helper  predicates.  The 
predicate  end  determines  if  a  configuration  lacks  successor 
(i.e.  end  =  -> EX  (true)  where  true  is  entailed  by  every  con¬ 
figuration),  seen(a)  is  true  when  an  action  a  has  occurred 
and  nocrit  holds  when  no  critical  actions  have  yet  occurred. 

— iE3 C' ,  C  =>•  C'  a  €  /  Va  €  l,  a  Crit 

C  (=  end  (o,  l )  |=  seen(a)  (o,  l )  |=  nocrit 

Now  we  have  the  tools  needed  to  define  the  desired  pred¬ 
icates.  If  we  have  reached  a  configuration  from  which  the 
system  cannot  continue  to  execute  and  all  actions  in  Logp  are 
seen,  then  the  configuration  is  an  okend ,  unless  the  configura¬ 
tion  is  a  forbidden  configuration.  It  is  thus  one  of  the  possible 
halting  configurations,  given  the  scenario  at  hand. 

Va  €  Logp,  C  |=  seen(a )  C  [=  end  C 
C  1=  okend 

If  it  is  true  for  all  reachable  configurations  that  whenever 
we  have  reached  an  okend,  we  have  seen  action  a,  we  con¬ 
clude  that  the  presence  of  a  is  entailed  from  observations  and 
system  model. 

C  (=  AG(^okend  V  seen(a)) 

C  |=  present(a) 

If  there  is  a  reachable  configuration  G\  such  that  no  critical 
actions  has  taken  place,  and  there  is  a  configuration  step  that 
takes  us  from  C±  to  C2  where  the  critical  action  a  has  oc¬ 
curred,  we  conclude  that  a  is  an  enabled  root  if  it  is  possible 
to  reach  an  okend  from  C2. 

a  €  Crit  C  |=  EF(nocrit  A  EX(seen(a )  A  EF(okend ))) 
C  |=  enabledroot(a) 


3.2  Reasoning  about  behavior 

Given  a  scenario,  the  strong  root  candidates  are  the  actions  a 
for  which 

({ init  i-»  Main},  0)  |=  present(a)  A  enabledroot(a ) 

If  we  have  no  strong  root  candidates  or  more  than  one  strong 
root  candidate,  the  system  model  is  not  strong  enough  for  ef¬ 
ficient  fault  isolation.  If,  on  the  other  hand,  we  have  exactly 
one  strong  root  candidate,  we  assume  that  we  have  pinpointed 
the  true  cause  of  the  fault.  This  is  reasonable  to  assume,  since 
the  action  found  is  the  only  one  that  is  known  to  have  occurred 
(its  presence  is  entailed  by  the  scenario)  and  it  is  consistent 
with  the  given  scenario  to  assume  that  the  action  is  a  root 
event. 

Of  course  there  is  still  a  possibility  that  there  are  other  en¬ 
abled  root  events  whose  presence  are  consistent  with  the  sce¬ 
nario,  but  assuming  one  of  them  to  be  root  would  demand  an 
explanation  to  why  the  strong  root  candidate  (proven  to  be 
present!)  is  not  the  root. 

3.3  Prototype  implementation 

We  have  designed  a  prototype  XSB  [Sagonas  el  ah,  1994] 
program  that  takes  a  system  model  and  observations  as  input 
and  enumerates  the  strong  root  candidates.  XSB  is  a  Prolog 
dialects  using  tabulation  (memoization)  to  improve  termina¬ 
tion.  Given  the  system  model  in  Figure  2  and  facts  stating 
that  any  sending  of  fail  or  down  indicates  system  failure, 
i.e.  those  actions  are  critical  actions,  and  the  observations 
that  ( 02,  fail )  has  not  occurred  and  ( o2,fail )  has  occurred, 
the  XSB  Prolog  program  computed  ( Oi,fail )  to  be  the  single 
strong  root  candidate. 

The  system  consists  of  three  objects  that  all  execute  the 
same  process.  See  Figure  4  for  an  automata  representation 
of  a  similar  process  (parameters  are  not  explicit  in  the  au¬ 
tomata).  Consider  the  critical  actions.  Obviously,  no  down 
message  can  be  root  action  since  it  will  always  be  preceded 
by  a  fail  action,  and  neither  can  ( 02,fail )  be  root  action  since 
it  is  known  to  not  have  occurred  at  all.  This  leaves  us  with 
(oi ,  fail)  and  (03 ,  fail) .  It  is  consistent  with  the  system  model 
and  the  observations  to  assume  that  ( o2,fail )  is  the  root  ac¬ 
tion,  since  if  02  receives  the  fail  from  03,  then  01  can  send 
fail  to  03  afterwards.  We  cannot  prove  that  (03 ,  fail )  has  hap¬ 
pened,  however.  This  can  be  done  for  ( Oi,fail ),  and  therefore 
it  is  the  only  action  that  is  both  enabled  root  and  present. 

Thus,  having  some  intuition  of  the  system  makes  the  fault 
isolation  described  above  almost  trivial,  but  the  key  motiva¬ 
tion  of  this  work  is  to  formalize  and  automate  this  intuition. 

4  Future  Work 

In  previous  work  with  Larsson  [Larsson  el  ai,  2000;  Larsson, 
1999]  we  studied  the  fault  isolation  problem  using  a  structural 
model.  A  key  feature  of  that  approach  is  the  use  of  software 
engineering  models,  in  particular  UML  [Object  Management 
Group,  1999]  class  diagrams.  Such  a  model  can  be  devel¬ 
oped  and  maintained  at  a  relatively  low  cost  being  an  inte¬ 
grated  part  of  the  software  development  process.  The  work 
presented  here  and  in  our  previous  work  [Lawesson,  2000; 


La  wesson  et  al.,  2001]  aims  to  strengthen  the  diagnostic  capa¬ 
bility  while  still  using  standard  and  state-of-the-art  modeling 
notations.  Behavior  in  UML  is  often  expressed  using  state- 
charts,  and  process  algebras  provide  a  textual  representation 
of  state  machines.  Of  course,  enforcing  the  software  devel¬ 
oper  to  construct  complete  statecharts  for  all  classes  is  not 
realistic  in  large  software  systems;  hence,  reasoning  must  be 
able  to  cope  with  incomplete  or  missing  behavioral  descrip¬ 
tions.  Our  approach  should  also  be  extended  to  deal  with  the 
special  features  characteristic  of  object  oriented  software  sys¬ 
tems  such  as  classes  and  inheritance.  Below  we  sketch  some 
partial  solutions  to  such  issues,  which  will  be  addressed  in 
our  future  work. 

4.1  Classes  behaviors  and  inheritance 

Our  process  algebra  expresses  a  system  model  as  a  flat  set 
of  the  process  defining  equations  without  any  hierarchy.  In 
an  object  oriented  design,  the  system  behavior  is  partitioned 
into  classes.  Furthermore,  inheritance  allows  for  a  hierarchy 
of  classes.  We  implement  simple  schemas  called  classes  in 
order  to  achieve  the  partitioning  and  (inheritance)  hierarchy. 

Thus,  in  the  following  a  class  is  a  scheme  that  can  be  com¬ 
piled  to  a  set  of  process  defining  equations.  A  class  C  may 
inherit  parts  of  its  characteristics  (e.g.  its  behavior)  from  a  su¬ 
perclass,  and  in  that  context  C  is  referred  to  as  the  subclass. 
A  state  inheritance  sequence 

S  [Ai,  A2,...,  An] 

is  a  declaration  saying  that  state  5  in  the  superclass  is  refined 
by  states  A\,  A2,  ...,  An  in  the  subclass  where  A±  is  the  de¬ 
fault  state  (i.e.  the  substate  entered  when  entering  the  super¬ 
state  S).  When  compiling  the  class  to  process  equations,  the 
inheritance  sequence  describes  how  the  defining  equations 
from  the  superclass  should  be  used.  Thus,  we  implement  a 
simple  form  of  inheritance  as  refinement.  The  syntax  used 
for  defining  classes  below  is 

N=(S,I),D 

where  N  is  the  name  of  the  class,  5  is  the  name  of  the  super¬ 
class  (if  any),  I  is  the  set  of  state  inheritance  sequences  and  D 
is  a  set  of  process  defining  equations.  If  there  is  no  superclass 
we  write  N  =  (),D. 

Example 

Lacking  formal  tools,  we  outline  the  approach  by  an  example. 
In  the  following  we  define  two  classes  Ci  and  C2,  where  C2 
refines  the  state  A  in  Ci  with  states  C  and  D.  We  say  that 
states  C  and  D  refine  state  A. 

Ci  =  (),{ 

A  d=f  b.B 

B  =f  a.A 

} 

C2  =  (C1,{A^[C,D]}),{ 

C  d=  d.D 

D  d=f  c.C 

B  d=  e.D 

} 


Now,  C2  may  be  compiled  to  the  following  process  equa¬ 
tions. 


C2:C 

def 

b.C2:B  +  d.C2:D 

C2:B 

def 

d.C2’-C  +  e.C2'-D 

C2:D 

def 

b.C2.B  +  c.C2:C 

The  outgoing  transitions  from  A  become  outgoing  transi¬ 
tions  from  all  refining  states,  while  the  incoming  transitions 
are  moved  from  the  refined  state  to  the  first  of  the  refining 
states.  If  there  are  transitions  from  the  same  state  in  both 
super-  and  subclass,  they  are  joined  as  indeterministic  choice, 
as  with  state  B  and  transitions  a. A  and  e.D.  The  states  are 
prefixed  with  the  class  name  to  avoid  name  space  clashes. 

4.2  Statecharts 

Since  both  processes  and  statecharts  have  a  transition  system 
semantics,  the  mapping  is  straightforward  once  the  semantics 
of  the  statecharts  is  fixed.  We  use  a  handshaking  semantics 
of  the  statecharts,  because  of  expressivity  and  domain  proper¬ 
ties  as  described  in  [La wesson,  2000].  We  define  the  seman¬ 
tics  via  our  process  language  by  providing  a  mapping  from 
statecharts  to  processes.  The  mapping  is  rather  straightfor¬ 
ward  since  we  restrict  ourselves  to  statecharts  without  history 
states  -  essentially  making  the  state  chart  equivalent  to  an 
automata  without  hierarchy,  see  for  example  [Lilius  and  Por- 
res,  1999].  The  process  algebra  example  in  Figure  2  could 
represent  a  slightly  improved  version2  of  the  automata  in  Fig¬ 
ure  4  with  structure  information  (i.e.  the  states  S,  Main  and 
Setup)  added. 

4.3  Default  behaviors  of  class  diagrams 

Since  a  class  diagram  in  general  does  not  contain  behavioral 
information  in  terms  of  statecharts,  we  may  introduce  a  su¬ 
perclass  called  Propagator  that  encapsulates  the  behavior  of 
being  able  to  propagate  errors  as  well  as  reporting  errors  to 
the  log,  and  a  subclass  Breakable  that  is  a  propagator  that  can 
introduce  errors  by  the  transition  crit.  The  idea  is  to  let  all 
classes  inherit  from  Propagator ,  and  then  refine  with  behav¬ 
ioral  models  when  available,  and  use  Breakable  for  classes 
that  may  give  rise  to  critical  actions  but  where  a  behavioral 
model  is  missing.  The  definition  of  Propagator  and  Break¬ 
able  are  given  in  Figure  5. 

The  paths  of  error  propagation  between  classes  is  com¬ 
puted  by  using  information  about  dependencies  between 
classes  in  the  class  diagrams  (as  in  [Larsson,  1999;  Larsson 
et  al.,  2000]),  and  then  reflected  in  the  Failed{x)  state  that 
models  error  propagation. 
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Figure  3:  A  global  picture  of  the  example  system  consisting  of  the  objects  o±,  02  and  03.  Each  object  has  behavior  as  described 
in  Figure  4. 


down! 


Figure  4:  An  automata  describing  a  peer-to-peer  system.  Sending  actions  are  suffixed  with  !  and  the  rest  of  the  actions  are 
receiving  actions.  There  are  no  internal  actions  in  this  automata. 


Propagatorn  =  (),{ 
Main 
OK(x) 

Failing(x) 
Failed(xi,  X2,  ■■■,  xn) 

} 


def 

def 

def 

def 


init(x).OK(x ) 

/  ail()  .Failing(x) 

log.Failed(x)  +  nolog. Failed(x) 

x\  :f  ail  ().  Fail  ed{x)  +  ...  +  xn:f  ail  ().  Fail  ed(x) 


Breakable  =  ( Propagator ,  {}),  { 

OK(x)  d—  crit.Failing(x) 

} 


Figure  5:  Definitions  of  the  classes  Propagator  and  Breakable 
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Abstract 

Networked  embedded  systems  are  composed  of  a  large  num¬ 
ber  of  physically  distributed  nodes  that  interact  with  the  phys¬ 
ical  world  via  a  set  of  sensors  and  actuators,  have  their  own 
computational  capabilities,  and  communicate  with  each  other 
via  a  wired  or  wireless  network.  Monitoring  and  diagnosis 
for  such  systems  must  address  several  challenges  caused  by 
the  distribution  of  resources,  communication  limitations,  and 
node  and  link  failures.  This  paper  presents  a  distributed  di¬ 
agnosis  framework  that  exploits  the  topology  of  a  physical 
system  to  be  diagnosed  to  limit  inter-diagnoser  communica¬ 
tion  and  compute  diagnoses  in  an  anytime  and  any  informa¬ 
tion  manner,  making  it  robust  to  communication  and  proces¬ 
sor  failures.  The  framework  adopts  the  consistency-based  di¬ 
agnosis  formalism  and  develops  a  distributed  constraint  sat¬ 
isfaction  realization  of  the  diagnosis  algorithm.  Each  local 
diagnoser  first  computes  locally  consistent  diagnoses,  tak¬ 
ing  into  account  local  sensing  information  only.  The  local 
diagnosis  sets  are  reduced  to  globally  consistent  diagnoses 
through  pairwise  communications  between  local  diagnosers. 
The  algorithm  has  been  successfully  demonstrated  for  the  di¬ 
agnosis  of  paper  path  faults  for  the  Xerox  DC265  printer. 

Introduction 

Our  diagnostic  research  is  motivated  by  existing  and  emerg¬ 
ing  applications  of  networked,  embedded  systems.  In  such 
systems  the  physical  plant  is  composed  of  a  large  number 
of  distributed  nodes,  each  of  which  performs  a  moderate 
amount  of  computation,  collaborates  with  other  nodes  via 
a  wired  or  wireless  network,  and  is  embedded  in  the  phys¬ 
ical  world  via  a  set  of  sensors  and  actuators.  Examples 
include  distributed  sensor  networks  (Chu,  Haussecker,  & 
Zhao  2001),  complex  electromechanical  systems  with  em¬ 
bedded  controllers  (Zhao  et  al.  2001),  data  networks,  smart 
matter  systems  (Jackson  et  al.  2001),  and  ad-hoc  wireless 
networks  of  consumer  devices.  Such  systems  present  a  num¬ 
ber  of  interesting  new  challenges  for  diagnostic  systems.  A 
moderate  amount  of  computation  is  potentially  available,  but 
it  is  partitioned  into  embedded  chunks  that  range  in  size 
from  tiny,  in  the  case  of  smart  dust  sensor  motes  (Kahn, 
Katz,  &  Pister  1999)  to  moderate  in  the  case  of  consumer  de¬ 
vices.  Communication  between  nodes  is  available,  but  may 
involve  unreliable  delivery,  power-constrained  wireless  net¬ 
works,  or  large,  complex  topologies  requiring  multiple  hops 
to  connect  two  arbitrary  nodes.  Finally,  nodes  might  leave 


the  network  dynamically  and  nodes  of  a  previously  unseen 
type  might  join  in  their  place. 

In  this  paper,  we  consider  how  we  might  apply  techniques 
from  model-based  diagnosis  to  these  types  of  problems.  In 
general,  traditional  model-based  techniques  are  centralized. 
They  assume  that  the  diagnostic  algorithm  is  run  on  a  sin¬ 
gle  processing  unit  that  has  access  to  observations  from  all 
sensors  in  the  physical  plant.  In  the  next  two  sections  of 
the  paper,  we  briefly  discuss  centralized,  model-based  tech¬ 
niques  and  discuss  how  they  cause  scalability,  robustness 
and  reconfigurability  problems  if  employed  directly  on  net¬ 
worked,  embedded  systems.  We  then  present  a  set  of  use¬ 
ful  properties  for  diagnostic  algorithms  for  such  systems. 
In  the  fourth  section,  we  present  a  simple  formulation  for 
diagnosis  of  discrete,  distributed  systems  in  order  to  mo¬ 
tivate  discussion  and  map  the  formulation  onto  distributed 
constraint  satisfaction  and  distributed  constraint  optimiza¬ 
tion.  We  next  propose  an  algorithmic  framework  for  dis¬ 
tributed  diagnosis  that  operates  in  an  anytime  manner  and  is 
robust  to  communication  and  processor  failures.  We  dis¬ 
cuss  the  communications  requirements  for  the  framework 
and  compare  performance  results  for  one  instantiation  of  the 
distributed  diagnosis  framework  against  a  centralized  diag¬ 
noser.  In  the  related  work  section,  we  discuss  why  exist¬ 
ing  distributed  constraint  satisfaction  and  optimization  algo¬ 
rithms  are  not  well  suited  for  distributed  diagnosis  of  net¬ 
worked,  embedded  systems.  We  finally  discuss  two  open 
areas  for  future  work.  The  contributions  of  this  paper  are 
that  it  illustrates  the  interesting  features  of  networked,  em¬ 
bedded  systems  that  make  them  challenging  for  traditional 
model-based  diagnosis  techniques,  it  presents  a  simple  for¬ 
mulation  of  the  distributed  diagnosis  problem  for  these  type 
of  systems  and  relates  it  to  distributed  constraint  satisfaction 
and  optimization,  it  presents  a  class  of  robust,  anytime  al¬ 
gorithms  for  performing  diagnosis,  and  it  illustrates  prelim¬ 
inary  diagnostic  results  on  a  model  of  a  real  physical  system 
with  comparisons  to  an  existing  centralized  diagnoser. 

Model-based  Diagnosis 

The  objective  of  diagnosis  is  to  determine  the  state  of  a  phys¬ 
ical  plant  such  as  a  printer,  aircraft  or  network,  based  upon 
the  current  sensor  readings  from  the  plant  and  prior  knowl¬ 
edge  about  the  plant’s  structure  and  behavior.  In  order  for 
the  diagnosis  to  be  useful  for  on-line  control  of  the  plant. 


accurate  diagnoses  must  be  generated  in  a  time-critical  man¬ 
ner  using  the  available  computational  resources.  In  most 
model-based  diagnostic  techniques,  prior  knowledge  about 
the  physical  plant  consists  of  a  description  of  the  behav¬ 
ior  of  each  component  of  the  plant,  including  normal  and 
faulty  behaviors,  and  the  interconnections  between  compo¬ 
nents  (Hamscher,  Console,  &  de  Kleer  1992).  Partial  ob¬ 
servability  presents  the  main  challenge  of  diagnosis.  Faults 
in  a  component  may  not  be  directly  observable,  and  in¬ 
stead  may  cause  changes  in  the  behavior  of  the  plant  that 
propagate  through  several  components  before  becoming  ob¬ 
servable  at  a  sensor.  To  perform  diagnosis,  the  component 
models  are  combined  into  a  global  store,  observations  are 
obtained  from  the  physical  plant,  and  a  centralized  algo¬ 
rithm  is  applied  to  find  a  system-wide  diagnosis.  We  be¬ 
lieve  this  very  abstract  description  captures  many  diagnostic 
formalisms,  including  logic -based  formalisms  such  as  those 
based  upon  (de  Kleer  &  Williams  1989)  or  (Reiter  1987), 
bond  graphs  (Mosterman  &  Biswas  1997)  and  many  others. 
Throughout  this  paper  we  will  use  a  formalism  and  exam¬ 
ples  consistent  with  GDE  (de  Kleer  &  Williams  1987)  and 
its  descendants,  keeping  in  mind  the  general  properties  of 
centralized,  model-based  diagnosis  that  are  at  issue. 

Figure  1  on  the  next  page  schematically  illustrates  a  small 
model  for  the  kind  of  traditional  problem  we  might  attack 
with  a  model-based  diagnoser.  The  24  boxes  represent 
rollers,  gears,  motors,  sensors  and  other  devices  in  a  printer 
paper  path.  For  example,  the  acRoll  acquires  a  sheet  of  pa¬ 
per  from  the  paper  tray  and  transports  it  to  the  feedRoll, 
driven  by  the  acBelt.  We  have  developed  a  simple  diagnos¬ 
tic  application  for  this  paper  path  system  using  L2  (Kurien  & 
Nayak  2000),  a  centralized,  GDE-style  diagnoser  developed 
by  NASA.  Each  component  is  modeled  by  finite  state  ma¬ 
chine  augmented  with  finite  domain  variables  that  describe 
its  behavior.  Arcs  between  components  in  Figure  1  repre¬ 
sent  interactions  between  components,  for  example  convey¬ 
ing  that  the  acRoll  receives  an  angular  velocity  from  the  ac¬ 
Belt.  This  is  represented  by  a  constraint  between  the  cor¬ 
responding  variables.  There  are  five  sensors  that  report  the 
time  of  arrival  of  a  sheet  of  paper  at  various  points  in  the 
paper  path. 

To  perform  diagnosis  with  L2  and  this  model,  observa¬ 
tions  as  to  when  or  if  the  paper  arrived  at  various  points  in 
the  path  would  first  be  obtained  from  the  printer’s  sensors 
via  its  internal  data  bus  and  sent  to  an  external  processor 
running  L2.  The  values  would  be  discretized  and  assigned 
to  the  corresponding  variables  in  the  constraint  system.  A 
constraint  optimization  algorithm  would  be  applied  to  the 
updated  constraint  system  to  find  assignments  to  the  vari¬ 
ables  that  are  consistent  with  the  observations.  Such  an  as¬ 
signment  might  represent  that  the  paper  was  late  at  the  first 
sensor  because  the  feedMotor  is  slow,  slowing  down  both 
the  acRoll  and  the  feedRoll.  This  information  could  then 
be  used  to  perform  maintenance,  or  in  systems  with  redun¬ 
dancy,  to  reconfigure  the  system  for  robust  control.  In  ad¬ 
dition  to  this  small  demonstration,  we  have  applied  similar 
diagnostic  techniques  to  spacecraft  (Bernard  el  al.  1998), 
chemical  processing  plants  (Goodrich  &  Kurien  2001),  sci¬ 
entific  instruments,  and  other  electromechanical  systems  to 


Given  a  set  of  component  models  and  a  centralized  diagnoser  C: 

1.  C  combines  the  component  models  in  a  central  store 

2.  Observations  are  collected  from  the  physical  system 

3.  C  computes  the  system-wide  diagnoses 

Figure  2:  Centralized  Diagnosis  of  a  Centralized  System 


Given  a  set  S  of  currently  connected  components  and  a  central¬ 
ized  diagnoser  C: 

1.  MS,  S  forwards  its  component  model  to  C 

2.  C  combines  the  component  models  in  a  central  store 

3.  MS,  S  forwards  its  observations  to  C 

4.  C  computes  the  system-wide  diagnoses 

5.  MS,  C  projects  the  variables  of  interest  to  S  from  the  diag¬ 
noses  and  forwards  them  to  S 

Figure  3:  Centralized  Diagnosis  of  a  Networked  System 


assist  in  robust  control. 

Challenges  of  Monitoring  and  Diagnosing 
Networked,  Embedded  Systems 

Suppose  we  would  like  to  perform  diagnosis  for  a  recon- 
figurable,  networked,  embedded  system.  Such  systems  are 
constructed  such  that  each  component  is  locally  controlled 
by  a  small,  embedded  processor  which  coordinates  with 
other  processors  via  a  potentially  unreliable  network.  In  ad¬ 
dition,  components  and  their  processors  might  be  unplugged 
and  replaced  with  upgraded  versions  from  time  to  time.  Ex¬ 
amples  of  such  systems  are  ad-hoc  wireless  networks,  modu¬ 
lar  robots,  and  more  conventional  systems  such  as  intranets. 
Even  traditional  electro-mechanical  systems  such  as  printers 
and  automobiles  now  contain  on-board  networks,  embedded 
sensing  and  tens  or  hundreds  of  local  controllers. 

We  can  provide  diagnostic  information  to  the  local  con¬ 
trollers  of  such  a  system  using  centralized  diagnosis  via  the 
process  outlined  in  Figure  3.  First,  a  centralized,  global  di¬ 
agnosis  problem  is  created  by  assembling  a  global  model  of 
the  components  within  a  centralized  diagnoser.  The  obser¬ 
vations  are  centrally  collected  and  a  diagnosis  or  set  of  diag¬ 
noses  are  computed  by  the  centralized  diagnoser.  Aspects  of 
the  centralized,  global  diagnosis  are  then  be  distributed  back 
to  the  local  controllers. 

This  approach  makes  several  assumptions.  First,  there 
must  exist  a  processor  large  enough  to  store  the  global  diag¬ 
nostic  model  and  run  the  centralized  diagnostic  algorithm. 
If  this  processor  fails,  it  must  be  acceptable  for  no  further 
diagnoses  to  be  generated.  Second,  there  must  exist  a  cen¬ 
tral  bus  or  buses  with  sufficient  capacity  to  forward  all  data 
needed  for  diagnosis  to  the  central  processor.  If  a  bus  fails, 
the  data  needed  to  diagnose  and  recover  for  the  failure  must 
be  located  on  the  near  side  of  the  bus  with  respect  to  the 
diagnostic  processor,  or  it  must  be  acceptable  for  no  further 
diagnoses  to  be  generated  for  the  bus  and  the  far  side  compo- 
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Figure  1:  Paper  Path  Model  in  Xerox  DC265ST  Printer 


nents.  Finally,  the  set  of  components  to  be  diagnosed  must 
be  represented  using  the  same  formalism,  and  in  most  appli¬ 
cations  must  be  known  a  priori. 

With  networked,  embedded  systems,  all  of  these  assump¬ 
tions  may  be  false.  Each  processor  in  the  plant  may  be  quite 
small.  If  a  processor  fails,  we  may  require  the  components 
attached  to  remaining  processors  to  continue  operating  in 
a  full  diagnosis  and  control  cycle.  If  the  network  is  bifur¬ 
cated,  we  may  require  that  each  half  of  the  plant  continues 
operations  to  the  extent  possible  and  works  to  resolve  the 
failure  with  the  locally  available  information.  New  compo¬ 
nents  might  join  into  the  network  at  any  time  by  publishing 
their  capabilities  such  as  described  by  JINI  (Sun  Microsys¬ 
tems  Inc  1999). 

These  issues  suggest  an  approach  wherein  we  do  not  arti¬ 
ficially  centralize  the  problem  but  allow  a  local  diagnoser  to 
be  associated  with  each  system  processor.  Each  local  diag¬ 
noser  finds  a  partial  diagnostic  solution  using  a  model  of  the 
locally  controlled  portion  of  the  plant  and  the  locally  avail¬ 
able  observations.  Communication  is  then  required  to  re¬ 
fine  the  partial  diagnostic  solution  into  a  diagnosis,  in  effect 
making  use  of  observations  and  models  local  to  other  diag¬ 
noses.  We  next  suggest  themes  for  dividing  and  coordinat¬ 
ing  the  diagnostic  process  to  maximize  scalability,  robust¬ 
ness  and  reconfigurability,  based  upon  our  experience  with 
both  diagnosis  and  networked,  embedded  systems. 

•  Scalability 

Dividing  the  diagnostic  problem  among  local  diagnoses 
allows  us  to  apply  multiple  processors  and  potentially  ad¬ 
dress  computational  scalability  problems  caused  by  the 
small  processors  we  may  encounter  in  some  embedded 
systems.  To  address  communication  scalability  issues, 
we  seek  to  exploit  the  topology  of  the  physical  plant. 
We  would  like  to  arrange  that  two  local  diagnosers  need 


communicate  only  if  the  subsystems  of  the  physical  plant 
they  correspond  to  are  physically  interconnected  or  share 
data.  Thus  the  structure  of  our  diagnostic  architecture 
will  mimic  the  physical  topology  of  the  plant  being  di¬ 
agnosed.  For  the  type  of  engineered  systems  that  are  typ¬ 
ically  amenable  to  diagnosis,  physical  scalability  is  ac¬ 
complished  by  modularizing  subsystems  and  connecting 
them  through  fairly  narrow  physical  interfaces  (power, 
data,  physical  support).  By  respecting  these  interfaces, 
we  expect  our  communication  needs  for  moving  diagnos¬ 
tic  data  to  scale  as  well  as  the  underlying  physical  plant. 

•  Robustness 

A  diagnostic  architecture  must  be  extremely  robust  to  fail¬ 
ure  and  able  to  operate  in  an  anytime  and  any  information 
manner.  This  can  be  accomplished  with  refinement.  We 
would  like  to  arrange  that  each  diagnoser  locally  produce 
a  superset  of  the  diagnoses  that  a  global  diagnoser  would 
produce  for  the  local  components.  Communication  with 
other  diagnosers  is  then  used  only  to  prune  the  local  diag¬ 
nosis  set.  This  yields  several  important  properties.  First, 
the  diagnostic  process  can  be  interrupted  at  any  time  and 
each  diagnoser  will  contain  the  true  diagnosis  plus  possi¬ 
ble  imposters.  This  is  an  important  safety  feature  in  do¬ 
mains  where  taking  action  based  upon  a  false  negative  can 
cause  serious  harm.  Second,  if  diagnosers  fail,  then  the 
remaining  diagnosers  will  simply  produce  coarser  (more 
conservative)  estimates  of  the  possible  states  of  their  com¬ 
ponents.  Third,  if  the  system  is  bifurcated  due  to  a  com¬ 
munication  failure,  then  each  half  will  produce  all  diag¬ 
noses  consistent  with  the  reachable  diagnosers  and  any 
state  of  the  other  half  of  the  system. 

•  Reconfigurability 

A  side  effect  of  employing  local  diagnosers  that  commu- 


cmdIn=open 


Figure  4:  Automaton  Representing  A  Single  Valve 


Figure  5:  Variable  Connectivity  In  a  Global  Model 


nicate  via  opaque  interfaces  defined  by  the  physical  plant 
is  natural  support  for  modular  or  reconfigurable  plants. 
Intuitively,  a  connected  subset  of  the  components  of  Fig¬ 
ure  1  may  be  disconnected  from  the  plant  and  replaced  by 
new  hardware  with  a  different  model,  so  long  as  the  phys¬ 
ical  and  diagnostic  interface  at  the  point  of  disconnection 
is  maintained.  In  addition,  this  opens  the  possibility  of 
participation  by  different  implementations  of  the  same  di¬ 
agnostic  algorithm  or  even  different  algorithms  participat¬ 
ing  in  a  diagnosis.  The  latter  would  of  course  require  an 
interface  that  is  semantically  meaningful  for  all  partici¬ 
pating  diagnosers.  However,  even  the  former  capability 
might  be  useful  in  allowing  vendors  of  components  that 
are  likely  to  be  connected  (e.g.  data  network  components 
or  power  distribution  components)  to  create  diagnosers 
that  can  collaborate. 

We  believe  these  properties  will  be  of  interest  as  we  begin  to 
investigate  applications  involving  very  large  numbers  of  em¬ 
bedded  processors  communicating  via  networks.  In  the  next 
section  we  introduce  a  simple  formalization  that  will  allow 
us  to  discuss  algorithmic  directions  for  type  of  problem. 

Centralized  Formulation 

Our  approach  to  distributed  qualitative  diagnosis  follows  the 
centralized  diagnostic  formalism  developed  in  (de  Kleer  & 
Williams  1989)  and  extended  in  (Williams  &  Nayak  1996) 
and  (Kurien  &  Nayak  2000).  To  motivate  our  distributed 
algorithms,  we  begin  with  a  brief  overview  of  the  central¬ 
ized  technique,  summarized  from  (Kurien  &  Nayak  2000). 
Suppose  we  would  like  to  diagnose  the  state  of  a  single  com¬ 
ponent,  a  valve,  which  is  qualitatively  modeled  via  the  finite 
state  machine  illustrated  in  Figure  4.  We  refer  to  each  possi¬ 
ble  discrete  state  of  a  component  as  a  mode.  A  valve  v  has 
three  modes,  open ,  closed,  and  stuckClosed.  The  behav¬ 
ior  of  the  flow  of  the  valve  within  each  mode,  which  has  the 
discrete  domain  {zero,  nonzero},  can  be  captured  with  the 
following  propositional  formulae. 


v  =  open 

=> 

floWv  =  nonzero 

v  =  closed 

=> 

flowv  =  zero 

v  =  stuckClosed 

=>■ 

flowv  —  zero 

If  flo'wv  is  observable  from  the  physical  plant,  we  will  refer 
to  this  variable  as  an  observation.  In  order  to  represent  the 
non-determinism  of  the  automaton  within  a  propositional 
framework,  the  encoding  introduces  an  assumption  variable 
a.  Intuitively,  av  represents  the  choice  that  Nature  makes 
as  to  whether  valve  v  will  behave  normally  or  experience  a 


failure  when  it  is  commanded.  The  transition  portion  of  the 
automaton  can  thus  be  captured  by  the  following  formulae. 


av  =  normals 


Vt 

=  closed  A  cmdt.  =  open 

Vt+1 

—  open 

Vt 

=  closed  A  cmdt.  A  open 

=>• 

Vt+l 

=  closed 

Vt 

=  open  A  cmdt  =  close 

=> 

Vt+1 

=  closed 

Vt 

=  open  A  cmdt.  A  close 

=*■ 

Vt+1 

=  open 

Vt 

=  stuckClosed 

=> 

Vt+1 

—  stuckClosed 

av  =  stick=>v,t+ i  =  stuckClosed 


Intuitively,  the  diagnostic  task  is  to  find  a  set  of  assignments 
to  the  assumptions,  here  {a,,},  such  that  the  model  is  consis¬ 
tent  with  the  observations,  here  {  flowv}.  For  example,  sup¬ 
pose  vt  =  closed,  we  command  the  valve  open,  represented 
by  cmdt.  =  open.  The  plant  assigns  O  as  flowv  =  zero. 
The  only  consistent  assignment  to  av  is  av  =  stick  and  we 
diagnose  valve  is  stuck  closed.  If  we  wish  to  model  multiple 
automata,  we  introduce  a  mode  and  assumption  for  each  au¬ 
tomaton  and  compile  all  automata  into  a  set  of  formulae  that 
may  share  variables.  For  example,  two  valves  in  series  share 
the  same  flow.  Figure  5  visualizes  the  compilation  of  the  de¬ 
vice  constraints  into  a  global  constraint  system  model.  Each 
node  represents  a  finite  domain  variable.  Two  nodes  are  con¬ 
nected  by  an  edge  if  the  two  variables  appear  in  a  constraint 
together,  denoting  that  the  possible  values  of  the  variables 
are  related  by  interacting  together  in  some  physical  process 
or  the  transmission  of  data.  Note  that  a  realistic  model  such 
as  that  of  Figure  5  contains  many  observations  and  assump¬ 
tions,  and  many  assignments  may  be  consistent.  More  for¬ 
mally,  let  A  denote  the  set  of  assumptions,  O  denote  the  set 
of  observations,  and  F  denote  the  formulae  describing  the 
plant.  Given  an  assignment  fl  to  O  created  by  observing 
the  plant,  a  diagnosis  D  is  an  assignment  to  A  such  that  the 
following  propositional  formula  is  consistent: 

A^e^K  =  di)  A 0j&0  ( oj  =  Wj)  A  F. 

To  perform  diagnosis  over  multiple  components,  we  must 
find  an  assignment  to  each  a  that  renders  the  set  of  formulae 
consistent  with  all  observations.  Intuitively,  we  assign  the 
observations  reported  by  the  physical  plant,  f 1  to  the  vari¬ 
ables  of  the  graph  corresponding  to  observations,  O,  then 
reassign  the  assumption  variables,  A  until  the  constraint  sys¬ 
tem  illustrated  in  Figure  5  becomes  consistent.  Thus  in  this 
diagnosis  framework,  diagnosis  can  be  viewed  a  constraint 
satisfaction  problem. 

A  second  diagnostic  task  is  to  find  the  most  likely  diag¬ 
noses.  For  each  assumption  assignment  we  can  associate 
the  prior  probability  of  the  even  the  assumption  represents. 


Figure  6:  Partition  Among  Three  Diagnosers 


Thus,  P(«,  =stick)  denotes  the  prior  probability  of  the  valve 
sticking.  Assuming  conditional  independence,  the  probabil¬ 
ity  of  a  diagnosis  is  defined  as  follows. 

P{D)  =  nai£DP{ai  =  di) 

Given  multiple  components,  we  must  find  the  assignment  to 
each  a  that  renders  the  set  of  formulae  consistent  with  all 
observations  such  that  the  probability  of  the  assignment  is 
maximal.  Intuitively,  we  assign  the  observations  reported 
by  the  physical  plant,  fl  to  the  variables  of  the  graph  corre¬ 
sponding  to  observations,  ().  then  choose  among  the  possi¬ 
ble  reassignments  of  assumption  values  to  assumption  vari¬ 
ables,  A,  until  the  constraint  system  illustrated  in  Figure  5 
becomes  consistent.  The  choice  of  which  assumption  to  re¬ 
assign  and  to  which  value  to  assign  it  is  based  upon  the  prob¬ 
ability  of  the  possible  assignments.  In  this  case,  diagnosis 
can  be  viewed  as  a  constraint  optimization  problem. 

Distributed  Diagnosis 

In  this  paper,  we  propose  splitting  the  global  diagnostic  pro¬ 
cess  into  a  number  of  cooperating  local  diagnostic  processes. 
In  order  to  distribute  the  problem,  we  divide  the  global  di- 
agnoser  which  produces  assignments  to  A  into  a  set  of  local 
diagnosers  which  make  assignments  to  subset  of  A.  Intu¬ 
itively,  we  partition  the  edges  of  Figure  5.  If  a  node  is  con¬ 
nected  to  edges  in  more  than  one  partition,  it  is  replicated 
and  the  partitions  must  reach  consensus  on  its  value.  More 
formally,  a  local  diagnoser  L  is  described  by  (F l,  Vl,  Al, 
Ol ,  Rl)  where  Fl  is  the  subset  of  F  assigned  to  L,  Vl 
denotes  the  set  of  variables  that  appear  in  F l,  Al  denotes 
AdVl,  Ol  denotes  OC\Vl  and  Rl  denotes  the  union  of 
VLnVM  over  all  other  diagnosers  M.  Figure  6  illustrates  a 
possible  partitioning  of  the  constraint  graph  of  Figure  5.  The 
slightly  darker  nodes  indicate  the  members  of  Rl,  shared 
variables  that  have  been  replicated.  Given  a  fixed  number 
of  diagnosers  or  the  maximum  number  of  constraints  a  diag¬ 
nostic  processor  can  accommodate,  we  can  use  a  graph  par¬ 
titioning  algorithm  (Sanchis  1989)  to  find  a  partitioning  of 
the  graph  that  attempts  to  minimize  Rl  for  each  diagnoser. 

Our  approach  to  finding  consistent  diagnoses  in  a  dis¬ 
tributed  fashion  is  refinement  based.  Intuitively,  each  local 
diagnoser  finds  the  diagnoses  for  the  locally  modeled  com¬ 
ponent  that  are  consistent  with  the  constraints  of  the  local 
model  and  the  local  observations.  This  is  a  superset  of  the 
diagnoses  for  the  local  components  that  are  consistent  will 
all  constraints  and  observations.  Each  local  diagnoser  then 


1.  Given  observation  set  Q,  if  Oj  £  Ol,  assign  Oj  =  uj.j  in  L. 

2.  VL,  if  Ol  7^  0,  compute  all  assignments  to  AlURl  s.t. 

^°jeoL  (oj  =Uj)  /\ai<=AL  {ai=di)  ArieRL  (r,  =  pi)  \=  FL 

3.  For  each  r  £  Rl,  for  each  other  diagnoser  M,  if  r  £  Vm  send 
all  Rl  assignments  to  M. 

4.  In  each  such  M,  compute  all  assignments  such  that 

(r*  =  pi)  GAm  ( ak  =  dk)  Ark£RM  =  pf)  |= 

F  M 

5.  If  the  consistent  Rm  assignments  decreased  in  step  4,  return 
to  step  3,  substituting  M  for  L. 

Figure  7:  Consistency-based,  Anytime  Diagnosis 


communicates  with  directly  other  diagnosers  to  further  re¬ 
duce  the  set  of  consistent  diagnoses  for  the  local  compo¬ 
nents.  We  would  like  that  the  diagnoses  start  with  a  superset 
of  the  globally  consistent  diagnoses  and  move  toward  only 
the  globally  consistent  diagnoses.  We  define  the  relation¬ 
ships  conservative  and  feasible  between  the  diagnoses 
produced  by  a  global  diagnoser  and  the  diagnoses  produced 
by  a  local  diagnoser.  A  local  diagnosis  set  Dl  is  conserva¬ 
tive  with  respect  to  the  global  diagnosis  set  Dq  if 

\/5g  €  Dg  FUi/^G)  €  Dl 

where  II  is  the  projection  operator.  That  is,  the  assignments 
made  to  the  assumptions  local  to  I  by  a  global  diagnosis 
must  also  be  made  by  a  local  diagnosis.  A  local  diagno¬ 
sis  set  Dl  is  feasible  if  the  assignments  made  to  the  local 
assumptions  are  contained  in  a  consistent  global  diagnosis. 
More  formally, 

\/Sl  €  Dl  3<5g  €  Dq  ■  FU^g)  =  &l- 

Incremental  Consistency 

We  next  discuss  an  algorithmic  framework  for  incrementally 
revising  a  set  of  conservative  diagnoses  into  a  set  feasible  di¬ 
agnoses  in  a  robust,  anytime,  distributed  manner,  followed 
by  results  from  one  particular  instantiation  of  this  frame¬ 
work.  The  approach  of  the  algorithmic  framework  is  similar 
in  spirit  to  Waltz’s  algorithm  (Waltz  1975).  Each  set  of  di¬ 
agnoses  is  monotonically  reduced  toward  a  feasible  set  as  a 
side  effect  of  spreading  consensus  on  the  value  of  variables 
shared  between  diagnosers.  The  algorithm  is  illustrated  in 
Figure  7. 

The  algorithm  operates  by  incrementally  reducing  the 
possible  assignments  to  Al  for  all  L,  first  by  introduction 
of  observations  and  second  by  communication  between  di¬ 
agnosers.  Each  local  diagnoser  begins  with  a  conservative 
local  diagnosis  set  in  Al.  Typically  this  would  be  all  possi¬ 
ble  diagnoses,  which  can  be  implicitly  captured  by  an  appro¬ 
priate  encoding  of  the  constraint  set  Fl-  In  Step  1,  observa¬ 
tions  are  assigned  in  every  diagnoser  which  has  constraints 
involving  an  observation.  In  Step  2,  the  observation  assign¬ 
ments  are  used  to  compute  all  assignments  to  AlURl  that 
are  consistent  with  Fl  and  the  observations  received  by  L. 
Note  that  the  projection  of  Al  from  these  assignments  is  a 


conservative  diagnosis  set.  Intuitively,  suppose  an  assign¬ 
ment  to  A  /,  appears  in  a  global  diagnosis  but  is  not  com¬ 
puted  by  L.  If  it  is  not  computed,  it  must  be  inconsistent  with 
F l  and  the  assignments  to  O /, .  It  is  therefore  inconsistent 
with  F  and  the  assignments  to  O,  and  could  not  appear  in  a 
global  diagnosis.  In  Step  3,  the  assignments  to  11/  are  pro¬ 
jected  out  of  the  consistent  assignments  of  L  and  forwarded 
to  each  other  diagnoser  M  that  references  these  variables.  In 
Step  4,  M  eliminates  a  subset  of  its  assignments  that  are  not 
feasible.  Intuitively,  an  assignment  a  to  Am  is  not  feasi¬ 
ble  if  there  is  no  assignment  to  A  containing  a  that  is  con¬ 
sistent  with  F  and  ().  If  a  constrains  a  variable  in  U/  to 
have  a  value  that  was  not  received  from  L,  then  a  is  incon¬ 
sistent  with  all  consistent  assignments  to  Al.  Thus,  each 
time  Step  4  is  performed,  infeasible  assignments  to  Am  are 
eliminated.  Each  diagnoser  begins  with  a  conservative  set 
of  assignments  to  Al,  and  as  rounds  of  communication  are 
performed,  the  local  diagnoses  are  moved  toward  feasibility 
in  an  anytime  manner.  Per  Step  5,  the  algorithm  continues  as 
long  as  consistent  assignments  are  eliminated.  In  the  worst 
case,  each  loop  would  eliminate  one  of  an  exponential  num¬ 
ber  of  possible  assignments. 

Note  that  we  have  described  the  algorithm  to  propagate 
sets  of  assignments  that  remain  consistent  in  one  local  di¬ 
agnoser  to  to  other  diagnosers  in  which  the  assigned  vari¬ 
ables  appear.  More  generally,  we  may  propagate  any  in¬ 
formation  that  allows  remote  diagnosers  to  restrict  the  do¬ 
main  of  a  variable  based  upon  inference  performed  in  the 
local  diagnoser.  Examples  include  assignments  that  cannot 
be  made  because  of  constraints  within  one  diagnoser  (no¬ 
goods),  assignments  that  must  be  made,  or  sets  of  possible 
assignments  to  a  variable  that  remain  consistent.  Note  also 
that  this  algorithm  is  not  complete  with  respect  to  distributed 
constraint  satisfaction.  Intuitively,  suppose  we  have  two  lo¬ 
cal  diagnosers,  one  containing  only  the  constraint  AvB  and 
the  other  containing  only  the  constraint  AwB.  Neither  can 
constrain  and  propagate  the  value  of  B ,  though  B  must  be 
true.  This  same  restriction  applies  to  the  centralized  con¬ 
straint  satisfaction  technique  used  in  L2,  so  we  do  not  be¬ 
lieve  it  presents  a  significant  drawback.  The  related  work 
section  contains  further  details  on  the  relationship  between 
distributed  diagnosis  and  distributed  constraint  satisfaction 
and  why  we  believe  an  incomplete  algorithm  is  sufficient. 

Communication  Requirements 

When  presented  with  a  networked,  embedded  system,  we 
may  perform  centralized  diagnosis  of  the  distributed  system 
by  transmission  of  observations  or  distributed  diagnosis  of 
the  distributed  system  by  transmission  of  intermediate  re¬ 
sults.  Choosing  distributed  diagnosis  allows  us  to  trade  com¬ 
munication  bandwidth  for  reduced  processor  requirements, 
increased  robustness  and  greater  reconfigurability.  In  this 
section,  we  examine  how  the  communication  requirements 
of  the  distributed,  incremental  diagnosis  algorithm  compare 
to  a  centralized  approach.  We  first  consider  the  communi¬ 
cation  requirements  of  the  centralized  procedure  shown  in 
Figure  3.  Let  n  be  the  number  of  components  and  s  be  the 
number  of  components  with  sensors.  In  Step  3  of  the  pro¬ 
cedure,  each  of  s  components  forwards  its  observations  to 


C.  In  Step  5,  C  forwards  the  diagnostic  results  to  each  of  n 
components.  Assuming  all  observations  from  a  single  com¬ 
ponent  can  be  sent  in  a  single  message.  Figure  3  requires 
s  point  to  point  messages  to  C  and  one  broadcast  message 
from  C  to  all  n  components 

We  now  consider  the  communication  requirements  for  the 
distributed  algorithm  of  Figure  7.  This  algorithm  performs 
distributed  diagnosis  by  exchanging  messages  that  refine  the 
value  of  shared  variables  across  local  diagnosers.  Let  v  be 
the  number  of  variables  that  are  shared,  and  r  be  the  av¬ 
erage  number  of  diagnosers  that  share  each  variable,  and 
to  be  the  average  number  of  messages  exchanged  that  in¬ 
volve  a  given  variable.  For  example,  if  each  local  diagnoser 
uses  unit  propagation,  it  can  send  messages  specifying  that 
a  variable  must  have  a  certain  value  or  cannot  have  a  certain 
value,  but  no  messages  specifying  disjunctions  between  as¬ 
signments.  Thus  to  is  bounded  by  the  size  of  the  largest  do¬ 
main  of  a  shared  variable.  The  increase  in  messages  created 
by  moving  to  the  distributed  diagnoses  technique  is  given  by 
the  ratio 

vrm 

ai  =  r 

s  +  1 

Note  that  counting  the  number  of  messages  exchanged  is 
not  sufficient  to  determine  the  cost  of  communication.  In 
many  applications,  such  as  wireless  networks  with  limited 
energy  or  bandwith,  the  number  of  packets  transmitted  is  a 
critical  cost  measure.  Network  topology  will  determine  the 
number  of  packet  transmissions  or  hops  required  to  deliver 
a  message.  In  many  applications,  each  node  in  a  network 
is  connected  to  a  small  number  of  neighbors.  Point  to  point 
communication  is  implemented  by  multiple  hops  between 
neighbors,  and  a  broadcast  is  implemented  by  flooding  the 
network.  Let  hc  be  the  average  distance  in  hops  between  a 
node  with  a  sensor  and  the  centralized  diagnoser.  Let  hv  be 
the  average  number  of  hops  between  nodes  that  share  a  vari¬ 
able.  In  general,  the  change  in  the  total  number  of  packet 
transmissions  required  by  decentralizing  the  problem  is  de¬ 
termined  by 

vrmhv 

«2  =  — : — 

shc  +  n 

Intuitively,  packet  transmission  for  the  centralized  diagnoser 
scales  with  the  size  and  width  of  the  network,  while  the  de¬ 
centralized  approach  scales  with  the  number  of  constraints 
that  cross  network  components.  Note  that  if  the  network 
topology  reflects  the  physical  interactions  of  the  compo¬ 
nents,  it  is  likely  the  case  that  hv  <C  hc.  Thus  we  can 
construct  wide  networks  with  very  localized  interactions  for 
which  centralized  diagnosis  requires  more  packet  transmis¬ 
sions  than  decentralized  diagnosis,  though  we  do  not  expect 
this  to  be  the  case  in  practice.  In  addition  to  total  packet 
transmission,  we  may  further  refine  our  cost  measure  to  in¬ 
clude  the  maximum  number  of  packets  transmitted  by  any 
link  in  the  network.  This  determines  the  minimum  band¬ 
width  or  power  storage  a  network  node  must  support.  The 
ratio  a 2  does  not  capture  that  in  the  centralized  case,  all  mes¬ 
sages  must  pass  through  network  links  connected  to  the  cen¬ 
tral  diagnoser.  This  drives  up  the  minimum  capabilities  of  a 
network  node  in  relation  to  distributed  diagnosis  where  mes¬ 
sage  sources  and  destinations  are  more  evenly  distributed 


Independent 
Faults  In 

L2 

Distributed 

Diag 

Time 

Spread 

Diag 

Time 

First  module 

6 

0.02 

9 

21 

0 

Two  modules 

12 

0.18 

14 

49 

0 

Three  modules 

84 

13.28 

20 

343 

0.05 

All  modules 

108 

27.08 

24 

637 

0.22 

Table  1:  Comparison  of  distributed  diagnoser  and  L2 


through  the  system.  We  are  currently  defining  a  diagnostic 
model  for  a  distributed  sensor  network  in  addition  to  avail¬ 
able  models  of  more  traditional  electro-mechanical  systems 
in  order  to  better  characterize  the  communication  require¬ 
ments  of  both  distributed  and  centralized  algorithms 


Results 

To  implement  the  distributed  diagnosis  algorithm  described 
above,  each  local  diagnoser  could  represent  its  conservative 
diagnosis  set  as  a  partial  assignment  in  a  GDE-style  diag¬ 
noser,  a  relational  table,  a  binary  decision  diagram  and  so 
on,  so  long  as  the  representation  can  be  efficiently  pruned 
when  an  observation  or  neighboring  diagnoser  decreases  the 
range  of  a  variable.  Ideally,  we  would  like  to  test  a  central¬ 
ized  diagnoser  against  a  set  of  local  diagnosers  that  compute 
and  represent  diagnoses  in  the  same  manner.  For  these  pre¬ 
liminary  results,  we  present  the  performance  of  the  central¬ 
ized  L2  diagnoser  against  a  distributed  diagnoser  that  takes 
advantage  of  the  small  local  model  size  enabled  by  distribut¬ 
ing  the  problem.  PARC  intern  Rong  Su  implemented  the 
distributed  algorithm  using  finite-state  automata  to  prune  in¬ 
consistent  assignments  to  VjJ  (Steps  2  and  4  of  Figure  7)  and 
a  distributed  consensus  algorithm  (Steps  3  and  5)  shown  to 
converge  to  feasible  diagnoses  (Su  el  al.  2002).  Table  1 
compares  performance  with  F2  on  the  paper  path  model. 
The  first  three  columns  are  the  name  of  the  diagnostic  sce¬ 
nario,  the  diagnoses  found  by  L2,  and  the  time  required. 
Since  the  physical  plant  has  few  sensors,  the  number  of  con¬ 
sistent  diagnoses  grows  with  the  complexity  of  the  scenario. 
The  fourth  column  is  the  number  of  local  diagnosers  reached 
via  Step  3  of  the  algorithm,  out  of  24.  The  fifth  column 
lists  the  number  of  diagnoses  found  by  the  distributed  al¬ 
gorithm.  Note  that  the  FSA-based  algorithm  finds  more  di¬ 
agnoses  than  L2.  L2  is  conflict  based,  and  thus  postulates 
only  those  failures  that  can  eliminate  a  discrepancy  between 
an  expected  observation  and  the  observation  received  from 
the  plant.  The  FSA-based  algorithm  finds  all  consistent  fail¬ 
ures,  including  those  that  would  be  indistinguishable  from 
proper  operation  of  the  plant.  The  sixth  column  is  the  time 
to  compute  the  diagnoses,  demonstrating  the  dramatic  speed 
advantage,  on  this  model,  of  computing  feasible  local  diag¬ 
noses  via  a  pre -compiled  FS  A  representation  then  determin¬ 
ing  consistent  combinations  versus  global,  on-line  inference. 
The  current  implementation  runs  each  local  diagnoser  seri¬ 
ally  on  a  single  processor,  and  we  believe  a  parallel  imple¬ 
mentation  will  provide  a  greater  speed  advantage. 


Related  Work 

A  diagnoser  for  a  networked,  embedded  system  may  be  cen¬ 
tralized,  decentralized  or  distributed.  Work  in  centralized 
diagnosis  may  be  applied  by  collecting  models  and  observa¬ 
tions  from  the  networked  components  of  the  physical  plant 
and  appling  a  centralized  algorithm.  As  described  in  the 
third  section  of  this  paper,  this  raises  robustness  and  scalabil¬ 
ity  issues  that  must  be  addressed.  Rish,  Brodie  and  Ma,  for 
example,  attempt  to  increase  the  efficiency  of  a  centralized 
diagnostic  procedure  for  a  distributed  network  of  computers 
using  an  approximate  representation  and  carefully  designed 
active  probing  of  the  distributed  system  (Rish,  Brodie,  &  Ma 
2002).  In  decentralized  diagnosis,  e.g.  (Debouk,  Fafortune, 
&  Teneketzis  2000),  local  diagnosers  communicate  with  a 
coordination  process  that  assembles  a  global  diagnosis.  The 
coordination  process  of  decentralized  approaches  are  still 
subject  to  robustness  and  scalability  issues.  We  are  there¬ 
fore  pursuing  an  approach  of  distributed  diagnosis,  similar 
to  (Baroni  el  al.  1999),  where  there  is  no  centralized  con¬ 
trol  structure  or  coordination  process.  Each  local  diagnoser 
communicates  directly  with  other  diagnosers. 

We  have  formulated  the  the  distributed  diagnostic  pro¬ 
cess  as  a  distributed  constraint  satisfaction  problem  (DCSP). 
Since  many  problems  in  scheduling,  resource  allocation,  and 
hardware  design  can  be  formulated  as  constraint  satisfaction 
problems,  the  distributed  constraint  satisfaction  problem  has 
received  a  large  amount  of  attention.  Yokoo  and  Hirayama 
provide  an  excellent  overview  (Yokoo  &  Hirayama  2000)  of 
algorithms  for  solving  DCSP’s.  These  existing  algorithms 
do  not  meet  our  needs  for  two  reasons.  First,  the  great  ma¬ 
jority  of  the  algorithms  are  formulated  assuming  the  com¬ 
putational  nodes  and  network  connecting  the  nodes  are  re¬ 
liable,  and  that  all  messages  sent  between  nodes  arrive  in 
the  order  sent.  For  diagnosis  of  networked,  embedded  sys¬ 
tems,  we  seek  specific  guarantees  of  behavior  in  response  to 
the  loss  of  computing  nodes  or  bifurcation  of  the  network. 
Second,  the  majority  of  DCSP  algorithms  are  designed  to 
solve  general  discrete  constraint  satisfaction  problems,  such 
as  the  graph  coloring  problem.  The  ability  to  solve  general 
CSP  problems  requires  features  that  complicate  distribution, 
such  as  backtracking  on  choices  for  variable  assignments. 
In  practice,  centralized  diagnosers  are  able  to  find  consis¬ 
tent  diagnoses  using  incomplete,  backtrack-free  procedures 
such  as  unit  propagation.  This  difference  arises  because  the 
constraints  we  generate  from  finite  state  models  such  as  il¬ 
lustrated  in  Figure  4  tend  to  be  closer  to  Horn  clauses  in 
structure  than  general  discrete  constraints  and  diagnosis  may 
use  observation  values  asserted  by  the  physical  plant  to  drive 
constraint  processing.  We  therefore  expect  a  distributed  di¬ 
agnoser  acting  upon  the  same  models  should  be  able  to  use 
less  powerful  inference  methods  than  full  constraint  satis¬ 
faction.  While  we  have  encountered  full  DCSP  algorithms 
that  allow  some  fault  tolerance,  such  as  the  Mozart  system 
(Roy  1999),  and  some  simpler  constraint  processing  meth¬ 
ods  that  assume  reliable,  fully  connected  networks,  such  as 
distributed  arc  consistency  (Nguyen  &  Deville  1998),  we 
have  not  yet  encountered  an  algorithm  that  is  sufficiently 
narrow  in  scope  and  robust  to  failures. 


Future  Work 

A  number  of  issues  remain  for  future  work.  The  issue  of 
how  to  use  knowledge  of  the  prior  probability  of  failures  to 
avoid  computing  all  consistent  diagnoses  has  been  explored 
but  not  solved.  The  algorithm  of  Figure  7  also  does  not  take 
into  account  any  information  about  the  likelihood  of  fail¬ 
ures.  We  may  of  course  find  the  set  of  globally  consistent 
diagnoses  and  compute  the  probability  of  each  by  assuming 
conditional  independence  of  the  failures,  as  described  above. 
However,  rather  than  computing  the  probabilities  of  all  con¬ 
sistent  diagnoses,  we  might  wish  to  avoid  generating  un¬ 
likely  diagnoses  given  we  have  generated  a  sufficient  num¬ 
ber  of  consistent,  likely  diagnoses.  Conflict-directed,  best- 
first  search  fde  Kleer  &  Williams  1989)  is  a  centralized,  dis¬ 
crete  constraint  optimization  algorithm  that  is  specialized  for 
diagnosis.  It  efficiently  enumerates  consistent  assignments 
to  a  set  of  propositional  variables  in  order  of  their  cost,  or  in 
this  case  enumerates  diagnoses  in  order  of  their  prior  prob¬ 
ability.  Intuitively,  it  operates  by  starting  with  the  highest 
probability  assignment  to  the  assumptions,  the  case  where 
no  failures  have  occurred.  It  substitutes  a  minimal  cost  as¬ 
signment  to  an  assumption  with  a  non-minimal  cost  assign¬ 
ment  only  when  a  conflict  between  an  observation  value  as¬ 
signed  by  the  plant  and  the  value  predicted  by  the  current 
assumption  assignments  occurs.  Our  current  direction  in 
developing  a  distributed  analog  is  to  begins  with  a  maxi¬ 
mum  likelihood  (e.g., no  failure)  assignment  to  ,4/  within 
each  diagnoser  L,  which  in  turn  constrains  the  shared  vari¬ 
ables.  When  diagnosers  L  and  M  disagree  on  the  value  of  a 
shared  variable  r,  each  performs  a  local  diagnosis  to  conser¬ 
vatively  approximate  the  maximum  probability  assignment 
to  the  assumptions  that  would  admit  a  different  value  for 
r.  This  information  can  then  be  used  to  limit  propagation 
of  variable  changes  throughout  the  system.  We  have  imple¬ 
mented  a  preliminary  version  of  this  system  using  copies  of 
L2  as  the  local  diagnosers  for  the  purposes  of  exploration, 
but  we  are  currently  limited  to  very  simple  network  topolo¬ 
gies.  Formalizing  a  reasonably  general  algorithm  for  gener¬ 
ating  a  conservative  estimate  of  the  most  likely  diagnoses  in 
a  robust,  distributed,  anytime  manner  remains  future  work. 

As  framed  here,  the  distributed  diagnoser  never  computes 
complete  global  diagnoses.  Rather,  at  each  local  diagnoser 
it  computes  feasible  local  diagnoses.  These  are  projections 
of  the  global  diagnoses  that  are  relevant  to  that  diagnoser.  In 
the  case  that  control  of  the  plant  is  distributed,  we  believe 
this  is  appropriate.  Each  processing  node  uses  the  possible 
states  of  its  components,  as  determined  by  the  feasible  local 
diagnoses,  to  inform  its  control.  However,  even  when  per¬ 
forming  distributed  diagnosis  of  a  distributed  system,  com¬ 
putation  of  the  global  diagnoses  may  be  of  interest  for  pur¬ 
poses  such  as  centralized,  supervisory  control  or  display  to  a 
user.  We  note  that  simply  taking  the  cross-product  of  the  fea¬ 
sible  diagnoses  produced  by  each  local  diagnoser  will  result 
in  a  superset  of  the  global  diagnoses.  Some  combinations 
of  the  cross-product  may  not  appear  in  any  consistent  global 
diagnosis.  If  the  consistent  global  diagnoses  are  needed,  we 
may  compute  them  by  checking  combinations  of  local  feasi¬ 
ble  diagnoses  from  multiple  diagnosers  against  a  combined 
model  using  a  linear-time  technique  such  as  unit  propaga¬ 


tion.  This  can  be  done  hierarchically  and  in  parallel,  allow¬ 
ing  us  to  rule  out  inconsistent  partial  combinations  of  local 
diagnoses  in  order  to  avoid  explicitly  checking  all  combina¬ 
tions.  Intuitively  and  from  initial  experiments,  we  suspect 
for  many  problems  this  technique  would  be  a  competitive 
method  for  producing  all  consistent  global  diagnoses.  In 
fact,  the  performance  numbers  for  the  FS  A-based  distributed 
algorithm  shown  in  Table  1  are  for  both  computing  the  con¬ 
servative  and  feasible  local  diagnoses  for  each  local  diag¬ 
noser  and  then  computing  the  globally  consistent  combina¬ 
tions  of  these  local  diagnoses.  Formalizing  this  technique 
and  more  thoroughly  investigating  its  effectiveness  remain 
future  work. 

Conclusion 

We  have  developed  a  distributed  diagnosis  framework  that 
leverages  the  topology  of  the  physical  plant  to  limit  inter- 
diagnoser  communication  and  compute  consistent  diagnoses 
in  an  anytime  and  any  information  manner,  making  it  ro¬ 
bust  to  communication  and  processor  failures.  The  frame¬ 
work  is  conservative,  in  that  it  avoids  false  negatives  in  fa¬ 
vor  of  false  positives  in  the  case  where  computation  cannot 
be  completed  due  to  limited  time  or  communication  failure. 
This  property  can  be  vital  in  applications  where  safety  is 
critical.  In  addition  to  being  anytime  and  conservative,  our 
approach  allows  a  very  small  granularity  for  the  local  di¬ 
agnosers.  We  can  potentially  create  a  diagnoser  per  physi¬ 
cal  component  if  desired.  This  flexibility  allows  us  to  con¬ 
sider  time/space/communication  tradeoffs  that  implement 
each  local  diagnoser  as  an  exponentially  large  (in  the  small 
local  model  size)  structure  that  enables  diagnosis  to  be  per¬ 
formed  collaboratively  on  very  weak  networked  processors. 
One  implementation  of  the  distributed  algorithm  for  finding 
consistent  diagnoses  has  been  implemented  using  a  discrete- 
event  formulation  and  tested  on  one  model.  Our  future  work 
includes  implementations  of  the  algorithm  using  binary  de¬ 
cision  diagrams  and  the  unit  propagation  implementation  of 
L2  to  compute  locally  consistent  assignments.  The  latter 
will  allow  direct  comparison  of  centralized  and  distributed 
implementations  of  the  same  diagnostic  technique  on  a  va¬ 
riety  of  problems  modeled  for  L2.  We  are  also  continuing 
to  extend  the  formulation  to  include  optimization-based  dis¬ 
tributed  diagnosis. 
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